巴西专利BR112014013336B1 APPARATUS AND METHOD FOR COMBINING SPATIAL AUDIO CODING FLOWS BASED ON GEOMETRY

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
apparatus and method for combining geometry-based spatial audio encoding streams. an apparatus for generating a combined audio data stream is provided. the apparatus comprises a demultiplexer (180) for obtaining a plurality of single-layer audio data streams, characterized in that the demultiplexer (180) is adapted to receive one or more input audio data streams, wherein each data stream of input audio comprises one or more layers, wherein the demultiplexer (180) is adapted to demultiplex each of the input audio data streams having one or more layers in the two or more demultiplexed audio data streams having exactly one layer, such that the two or more audio data streams demultiplexed together comprise one or more layers of the input audio data stream. further, the apparatus comprises a combining module (190) for generating the combined audio data stream, having one or more layers, based on the plurality of single layer audio data streams. each layer of the input audio data streams, the demultiplexed audio data streams, the single layer data streams and the combined audio data stream comprises a pressure value of a pressure signal, a position value and a value to broadcast the audio data.
公开号:BR112014013336B1
申请号:R112014013336-0
申请日:2012-11-30
公开日:2021-08-24
发明作者:Giovanni Del Galdo；Thiergart Oliver；Herre Jürgen；Küch Fabian；Habets Emanuel；Craciun Alexandra；Kuntz Achim
申请人:Fraunhofer-Gesellschaft Zur Forderung Der Angewandten Forschung E.V.；
IPC主号:

专利说明:

DESCRIPTION
[0001] The present invention relates to audio processing and in particular to an apparatus and a method for generating a combined audio data stream is provided.
[0002] Audio processing and, in particular, spatial audio coding becomes increasingly important. Traditional spatial sound recording aims to capture a sound field so that on the playback side a listener perceives the sound image as if it were at the recording location. Different approaches to spatial sound recording and reproduction techniques are known from the state of the art, which can be based on channel, object or parametric representations.
[0003] Channel-based representations represent the sound scene by means of discrete audio signals N to be reproduced by speakers arranged in a known configuration, for example, a 5.1 surround sound configuration. The approach to spatial sound recording generally employs spaced omnidirectional microphones, for example, in AB stereophony, or coincident directional microphones, for example, in loudness stereophony. Alternatively, more sophisticated microphones, such as a B-shaped microphone, can be used, for example, in Ambisonics, see:
[0004] [1] Michael A. Gerzon. Ambisonics in multichannel broadcasting and video. J. Audio Eng. Soc, 33(11):859-871, 1985.
[0005] The desired speaker signals for the known setup are derived directly from the recorded microphone signals and are then transmitted or stored discretely. A more efficient representation is obtained by applying audio coding to discrete signals, which in some cases encodes information from different channels together for increased efficiency, for example, in MPEG-Surround to 5.1, see:
[0006] [21] J. Herre, K. Kjörling, J. Breebaart, C. Faller, S. Disch, H. Purnhagen, J. Koppens, J. Hilpert, J. Rödén, W. Oomen, K. Linzmeier, K. KS Chong: "MPEG Surround - The ISO/MPEG Standard for Efficient and Compatible Multichannel Audio Coding", 122nd AES Convention, Vienna, Austria, 2007, Preprint 7084.
[0007] An important disadvantage of these techniques is that the sound scene, once the speaker signals have been computed, cannot be modified.
[0008] Object-based representations are, for example, used in Spatial Audio Object Coding (SAOC I Spatial Audio Object Coding), see
[0009] [25] Jeroen Breebaart, Jonas Engdegârd, Cornelia Falch, Oliver Hellmuth, Johannes Hilpert, Andreas Hoelzer, Jeroens Koppens, Werner Oomen, Barbara Resch, Erik Schuijers, and Leonid Terentiev. Spatial audio object coding (saoc) - the upcoming mpeg standard on parametric object based audio coding. In Audio Engineering Society Convention 124, 5 2008.
[00010] Object-based representations represent the sound scene with N discrete audio objects. This representation provides high flexibility on the playback side, as the sound scene can be manipulated by changing, for example, the position and noise of each object. Although this representation can be readily available from, for example, a multitrack recording, it is very difficult to obtain from a complex sound scene recorded with some microphones (see, for example, [21]). In fact, transmitters (or other objects that emit sound) must first be located and then extracted from the mix, which can cause interference.
[00011] Parametric representations generally employ spatial microphones to determine one or more downmix audio signals along with lateral information describing the spatial sound. One example is Directional Audio Coding (DirAC | Directional Audio Coding), as discussed in
[00012] [29] Ville Pulkki. Spatial sound reproduction with directional audio coding. J. Audio Eng. Soc, 55(6):503-516, June 2007.
[00013] The term "spatial microphone" refers to any apparatus for the acquisition of spatial sound capable of retrieving the direction of sound arrival (eg combination of directional microphones, microphone arrays, etc.).
[00014] The term "non-spatial microphone" refers to any device that is not adapted to retrieve the direction of sound arrival, such as a single omnidirectional or directive microphone.
[00015] Another example is proposed in:
[00016] [4] C. Faller. Microphone front-ends for spatial audio coders. In Proc, of the AES 125th International Convention, San Francisco, Oct. 2008.
[00017] In DirAC, the spatial suggestion information comprises the direction of arrival (DOA) of sound and the diffusion of the calculated sound field in a time-frequency domain. For sound reproduction, audio reproduction signals can be derived based on the parametric description. These techniques offer great flexibility on the playback side, as an arbitrary speaker configuration can be employed, as the representation is particularly flexible and compact, and comprises a monoaudio downmix signal and side information, and because it allows for easy scene modifications eg acoustic zoom, directional filtering, scene blending, etc.
[00018] However, these techniques are still limited by the fact that the recorded spatial image is always relative to the spatial microphone used. Thus, the acoustic point of view cannot be varied and the listening position within the sound scene cannot be changed.
[00019] A virtual microphone approach is presented in
[00020] [22] Giovanni Del Galdo, Oliver Thiergart, Tobias Weller, and E.A.P. Habets. Generating virtual microphone signals using geometrical information gathered by distributed arrays. In Third Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA'll), Edinburgh, United Kingdom, May 2011.
[00021] which allows to calculate the output signals of an arbitrary spatial microphone virtually available (ie, arbitrary position and orientation) in the environment. The flexibility that characterizes the virtual microphone (VM I virtual microphone) approach allows the sound scene to be virtually captured in a post-processing step, but no sound field representation becomes available that can be used to transmit and/ or store and/or modify the sound scene efficiently. Yet only one source per time-frequency position is assumed to be active, and thus, it cannot correctly describe the sound scene if two or more sources are active at the same time-frequency position. Also, if the virtual microphone (VM) is applied on the receiver side, all microphone signals need to be sent over the channel, which makes the representation inefficient, where if the VM is applied on the transmitter side, the sound scene it can no longer be manipulated and the model loses flexibility and becomes limited in a particular speaker configuration. Still, it does not consider a sound scene manipulation based on parametric information.
[00022] In
[00023] [24] Emmanuel Gallo and Nicolas Tsingos. Extracting and re-rendering structured auditory scenes from field recordings. In AES 30th International Conference on Intelligent Audio Environments, 2007,
[00024] the estimation of the sound source position is based on the time difference in pairs of the arrival measured by means of distributed microphones. Furthermore, the receiver is recording dependent and requires all microphone signals for synthesis (eg, generation of speaker signals).
[00025] The method presented in
[00026] [28] Svein Berge, Device and method for converting spatial audio signal. US patent application, Appl. No. 10/547,151,
[00027] uses, similarly to DirAC, the arrival direction as a parameter, thus limiting the representation to a specific sound scene point of view. Still, it does not propose the possibility of transmitting/storing the representation of the sound scene, since the analysis and synthesis need to be applied on the same side of the communication system.
[00028] Another example can be video conferencing applications, in which the parts being recorded in different environments need to be reproduced in a single sound scene. A Multipoint Control Unit (MCU I Multipoint Control Unit) must make sure that a single sound scene is reproduced.
[00029] In
[00030] [22] G. Del Galdo, F. Kuech, M. Kallinger, and R. Schultz-Amling. Efficient merging of multiple audio streams for spatial sound reproduction in directional audio coding. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2009), 2009
[00031] and in
[00032] [23] US 20110216908: Apparatus for Merging Spatial Audio Streams
[00033] the idea of combining two or more parametric representations of a sound scene was proposed.
[00034] However, it would be highly beneficial if concepts were provided to create a single sound scene from two or more representations of the sound scenes in an efficient way flexible enough to modify the sound scene.
[00035] The aim of the present invention is to provide improved concepts to generate a combined audio data stream, for example, a GAC stream. The object of the present invention is solved by an apparatus according to claim 1, by a method according to claim 17 and by a computer program according to claim 18.
[00036] According to an application, an apparatus for generating a combined audio data stream is provided. The apparatus comprises a demultiplexer for obtaining a plurality of single-layer audio data streams, wherein the demultiplexer is adapted to receive one or more input audio data streams, wherein each input audio data stream comprises a or more layers, wherein the demultiplexer is adapted to demultiplex each of the input audio data streams having one or more layers into two or more demultiplexed audio data streams having exactly one layer, so that one or more streams of audio data demultiplexed together comprises one or more layers of the input audio data streams, providing two or more of the single layer audio data streams. Further, the apparatus comprises a combining module for generating the combined audio data stream, having one or more layers, based on the plurality of single-layer audio data streams, for example, based on the plurality of data streams. demultiplexed single-layer audio. Each layer of the input audio data streams, the demultiplexed audio data streams, the single-layer data streams and the combined audio data stream comprises a pressure value of a pressure signal, a position value and a value to broadcast the audio data.
[00037] In another application, the apparatus may comprise a demultiplexer for obtaining a plurality of single-layer audio data streams, wherein the demultiplexer is adapted to receive two or more input audio data streams, wherein each stream The input audio data stream comprises one or more layers, wherein the demultiplexer is adapted to demultiplex each of the input audio data streams having two or more layers into two or more demultiplexed audio data streams having exactly one layer, so that the two or more audio data streams demultiplexed together comprise the two or more layers of the input audio data streams, to obtain two or more of the single layer audio data streams. Further, the apparatus may comprise a combining module for generating the combined audio data stream, having one or more layers, based on the plurality of single layer audio data streams.
[00038] In an application, the apparatus can be adapted to insert one or more incoming audio data streams having exactly one layer directly into the combination module without inserting them to the demultiplexer.
[00039] Each layer of the input audio data streams, the demultiplexed audio data streams, the single layer data streams and the combined audio data stream comprises a pressure value of a pressure signal, a value of position and a broadcast value of the audio data, the audio data being set to a time-frequency position of a plurality of time-frequency positions.
[00040] According to this application, two or more recorded sound scenes are combined into one by means of combining two or more audio data streams, eg GAC streams, and by outputting a single audio data stream. audio, for example, a single GAC stream.
[00041] Combining sound scenes can be used, for example, in video conferencing applications, in which parts being recorded in different environments need to be played back in a single sound scene. The combination can then take place in a Multipoint Control Unit (MCU | Multipoint Control Unit) to reduce network traffic or, in end users, to reduce the computational cost of synthesis (eg computing the overhead signals -speaker).
[00042] In an application, the combination module can comprise a cost function module to assign a cost value to each of the single layer audio data streams, and in which the combination module can be adapted to generate the combined audio data stream based on the cost values assigned to the single-layer audio data streams.
[00043] According to another application, the cost function module can be adapted to assign the cost value to each of the single layer audio data streams depending on at least one of the pressure values or the diffusion values of the single-layer audio data stream.
[00044] In another application, the cost function module can be adapted to assign a cost value to each audio data stream of the group of single-layer audio data streams by applying the formula:

[00045] characterized in that P± is the pressure value and Φi is the layer diffusion value of an audio data stream i- th of the group of single layer audio data streams, for example, for each position of time-frequency.
[00046] According to another application, the combination module may further comprise a pressure combination unit, wherein the pressure combination unit can be adapted to determine a first group comprising one or more audio data streams layer from the plurality of single layer audio data streams and to determine a second group comprising one or more different single layer audio data streams from the plurality of single layer audio data streams, wherein the value of cost of each of the single-layer audio data streams of the first group may be greater than the cost value of each of the single-layer audio data streams of the second group, or where the cost value of each one of the single layer audio data streams of the first group may be less than the cost value of each of the single layer audio data streams of the second group, whereby the pressure combination unit may be adapted to generate one or more pressure values from one or more layers of the combined audio data stream, such that each pressure value from each of the single layer audio data streams of the first group can be a pressure value of one of the layers of the combined audio data stream, and so that a combination of the pressure values of the second group single layer audio data streams can be a pressure value of one of the layers of the combined audio data stream .
[00047] In another application, the combination module may further comprise a broadcast combination unit, wherein the broadcast combination unit can be adapted to determine a third group comprising one or more layer audio data streams of the plurality of single layer audio data streams and to determine a fourth group comprising one or more different single layer audio data streams of the plurality of single layer audio data streams. The cost value of each of the third group single layer audio data streams may be greater than the cost value of each of the fourth group single layer audio data streams, or where the value of cost of each of the third group single layer audio data streams can be less than the cost value of each of the fourth group single layer audio data streams, where the broadcast combination unit can be adapted to generate one or more spread values from one or more layers of the combined audio data stream so that each spread value from each of the third group single layer audio data streams can be a value of spread from one of the layers of the combined audio data stream, and so that a combination of the spread values of the single layer audio data streams of the fourth group can be a spread value from one of the layers of the combined audio data stream. combined audio.
[00048] According to another application, the combination module may further comprise a position mixing unit (1403), wherein the position mixing unit (1403) can be adapted to determine a fifth group comprising one or more single layer audio data streams from the plurality of single layer audio data streams, wherein the cost value of each of the fifth group single layer audio data streams may be greater than the value of cost of any single layer audio data stream not comprised in the fifth group of the plurality of single layer audio data streams, or where the cost value of each of the single layer audio data streams of the fifth group is less than the cost value of any single layer audio data stream not comprised in the fifth group of the plurality of single layer audio data streams. The position mixing unit (1403) can be adapted to generate one or more position values from one or more layers of the combined audio data stream, so that each position value from each of the audio data streams single layer of the fifth group can be a position value of one of the layers of the combined audio data stream.
[00049] In another application, the combining module may further comprise a sound scene adaptation module to manipulate the position value of one or more of the single layer audio data streams of the plurality of audio data streams single layer.
[00050] According to another application, the sound scene adaptation module can be adapted to manipulate the position value of one or more of the single layer audio data streams of the plurality of single layer audio data streams by applying a rotation, a translation, or a non-linear transformation in the position value.
[00051] In another application, the demultiplexer can comprise a plurality of demultiplexing units, each of the demultiplexing units can be configured to demultiplex one or more of the input audio data streams.
[00052] According to another application, the apparatus may further comprise an artificial sound source generator to generate an artificial data stream comprising exactly one layer, wherein the artificial source generator can be adapted to receive pressure information being represented in a time domain and to receive a position information, wherein the artificial source generator may be adapted to replicate the pressure information to generate position information for a plurality of time-frequency positions, and wherein the generator The artificial source can be further adapted to calculate the diffusion information based on the pressure information.
[00053] In another application, the artificial source generator can be adapted to transform the pressure information being represented in a time domain to a time-frequency domain.
[00054] According to another application, the artificial source generator can be adapted to add reverberation to the pressure information.
[00055] Another application allows you to insert an artificial sound source to the sound scene. Inserting an artificial sound source is particularly useful in virtual reality and applications such as video games, where a complex sound scene can be multiplied by synthetic sources. In teleconferencing scenarios insertion is useful when combining parties that communicate through a single channel, for example, dialing through telephones.
[00056] Preferred applications of the present invention will be described below, in which:
[00057] Fig. 1 illustrates an apparatus for generating a combined audio data stream according to an application,
[00058] Fig. 2a illustrates an apparatus for generating at least one audio output signal based on an audio data stream comprising audio data referring to one or more sound sources according to an application,
[00059] Fig. 2b illustrates an apparatus for generating an audio data stream comprising sound source data relating to one or more sound sources according to an application,
[00060] Fig. 3a-3c illustrate audio data streams according to different applications,
[00061] Fig. 4 illustrates an apparatus for generating an audio data stream comprising sound source data referring to one or more sound sources according to another application,
[00062] Fig. 5 illustrates a sound scene composed of two sound sources and two uniform linear microphone arrays,
[00063] Fig. 6a illustrates an apparatus 600 for generating at least one audio output signal based on an audio data stream according to an application,
[00064] Fig. 6b illustrates an apparatus 660 for generating an audio data stream comprising sound source data relating to one or more sound sources according to an application,
[00065] Fig. 7 describes a modification module according to an application,
[00066] Fig. 8 describes a modification module according to another application,
[00067] Fig. 9 illustrates transmitter/analysis units and receiver/synthesis units according to an application,
[00068] Fig. 10a describes a synthesis module according to an application,
[00069] Fig. 10b describes a first unit of the synthesis stage according to an application,
[00070] Fig. 10c describes a second unit of the synthesis stage according to an application,
[00071] Fig. 11 describes a synthesis module according to another application,
[00072] Fig. 12 illustrates an apparatus for generating an audio output signal from a virtual microphone according to an application,
[00073] Fig. 13 illustrates the inputs and outputs of an apparatus and a method for generating an audio output signal of a virtual microphone according to an application,
[00074] Fig. 14 illustrates the basic structure of an apparatus for generating an audio output signal from a virtual microphone according to an application comprising a sound event position evaluator and an information computing module,
[00075] Fig. 15 shows an exemplary scenario in which real space microphones are described as Uniform Linear Arrays of 3 microphones each,
[00076] Fig. 16 depicts two 3D space microphones to estimate the direction of arrival in 3D space,
[00077] Fig. 17 illustrates a geometry where a sound source of the isotropic point type of the current time-frequency position (k, n) is located at a position pIPLs (k, n) ,
[00078] Fig. 18 describes the information computing module according to an application,
[00079] Fig. 19 describes the information computing module according to another application,
[00080] Fig. 20 shows two real space microphones, a localized sound event and the position of a virtual space microphone,
[00081] Fig. 21 illustrates how to obtain the direction of arrival with respect to a virtual microphone according to an application,
[00082] Fig. 22 describes a possible way to derive the DOA of the sound from the point of view of the virtual microphone according to an application,
[00083] Fig. 23 illustrates an information computation block comprising a diffusion computation unit according to an application,
[00084] Fig. 24 describes a diffusion calculation unit according to an application,
[00085] Fig. 25 illustrates a scenario, where estimating the position of sound events is not possible,
[00086] Fig. 26 illustrates an apparatus for generating a virtual data stream from the microphone according to an application, and
[00087] Fig. 27 illustrates an apparatus for generating at least one audio output signal based on an audio data stream according to another application,
[00088] Fig. 28 describes the inputs and outputs of an apparatus for generating a combined audio data stream according to another application,
[00089] Fig. 23 illustrates an apparatus for generating a combined audio data stream according to another application,
[00090] Fig. 30 describes a combination module according to an application,
[00091] Fig. 31a - 31c describe possible scene sound scenes, and
[00092] Figs. 32a -32b illustrate artificial source generators according to applications.
[00093] Figs. 33a-33c illustrate scenarios where two microphone arrays receive direct sound, sound reflected off a wall, and diffused sound.
[00094] Before providing a detailed description of the applications of the present invention, an apparatus for generating an audio output signal from a virtual microphone is described to provide background information regarding the concepts of the present invention.
[00095] Figure 12 illustrates an apparatus for generating an audio output signal to simulate a microphone recording in a configurable virtual posVmic position in an environment. The apparatus comprises a sound event position evaluator 110 and an information computing module 120. The sound event position evaluator 110 receives first direction information dil from a first real spatial microphone and a second direction information di2. of a second real space microphone. The sound events position evaluator 110 is adapted to estimate a sound source position SSP indicating a position of a sound source in the environment, the sound source emitting a sound wave, wherein the sound events position evaluator 110 is adapted to estimate the position of sound source ssp based on a first dil direction information provided by a first real spatial microphone being located at a first position of the real poslmic microphone in the environment, and based on a second di2 direction information provided by a second real space microphone being located at a second real microphone position in the room. The information computing module 120 is adapted to generate the audio output signal based on a first recorded audio input signal is being recorded by the first real spatial microphone, based on the first position of the real poslmic microphone and based on the virtual posVmic position of the virtual microphone. The information computing module 120 comprises a propagation compensator being adapted to generate a modified first audio signal by modifying the first recorded audio input signal isl by compensating for a first delay or amplitude decline between an arrival of the sound wave emitted by the sound source in the first real spatial microphone and a sound wave arrival in the virtual microphone by adjusting an amplitude value, a magnitude value or a phase value of the first recorded audio input signal isl to obtain the audio output signal. .
[00096] Figure 13 illustrates the inputs and outputs of an apparatus and a method according to an application. Information from two or more real space microphones 111, 112, 11N is input to the apparatus/is processed by the method. This information comprises audio signals collected by real space microphones as well as direction information from real space microphones, eg direction of arrival (DOA) estimates. Audio signals and direction information such as arrival direction estimates can be expressed in a time-frequency domain. If, for example, a 2D geometry reconstruction is desired and a traditional STFT domain (short time Fourier transformation | short-duration Fourier transform) is chosen for the representation of the signals, the DOA can be expressed as ken-dependent azimuth angles , namely the frequency and time indices.
[00097] In applications, the location of the sound event in space, as well as description of the position of the virtual microphone can be conducted based on the positions and orientations of the virtual and real space microphones in a common coordinate system. This information can be represented by entries 121 . . . 12N and input 104 in Fig. 13. Input 104 can further specify the characteristic of the virtual space microphone, for example, its position and collection pattern, as will be discussed below. If the virtual space microphone comprises several virtual sensors, their positions and corresponding different collection patterns can be considered.
[00098] The output of the apparatus or a corresponding method can be, when desired, one or more sound signals 105, which may have been collected by a spatially defined microphone and placed as specified by 104. Further, the apparatus (or the method) can provide as output the corresponding spatial lateral information 106 which can be estimated using the virtual spatial microphone.
[00099] Figure 14 illustrates an apparatus according to an application, comprising two main processing units, a sound events position evaluator 201 and an information computing module 202. The sound events position evaluator 201 can perform the geometric reconstruction based on the DOA's comprised in inputs 111...UN and based on the knowledge of the position and orientation of the real spatial microphones, where the DOA's were calculated. The output of the sound events position evaluator 205 comprises the estimates of the position (both in 2D and 3D) of the sound sources where the sound events occur at each time and frequency slice. The second processing block 202 is an information computing module. According to the application of Fig. 14, the second processing block 202 calculates a virtual microphone signal and spatial lateral information. It is further also referred to as the virtual microphone signal and side information calculation block 202. The virtual microphone signal and side information calculation block 202 uses the positions of the sound events 205 to process the audio signals comprised in 111...11N to output the audio signal from the virtual microphone 105. Block 202, if requested, can also compute spatial lateral information 106 corresponding to the virtual spatial microphone. The applications below illustrate possibilities, how blocks 201 and 202 can operate.
[000100] In the following, the position estimation of an evaluator of the position of the sound events according to an application is described in more detail.
[000101] Depending on the dimensionality of the problem (2D or 3D) and the number of space microphones, several solutions for the position estimation are possible.
[000102] If two 2D space microphones exist, (the simplest possible case) a single triangulation is possible. Figure 15 shows an exemplary scenario, in which real space microphones are described as Uniform Linear Arrays (ULA's I Uniform Linear Arrays) of 3 microphones each. The DOA, expressed as the azimuth angles al(k, n) and a2(k, n) , are calculated for the time-frequency (k, n) position. This is achieved by employing a suitable DOA evaluator such as ESPRIT,
[000103] [13] R. Roy, A. Paulraj, and T. Kailath, "Direction-of-arrival estimation by subspace rotation methods - ESPRIT," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Stanford, CA, USA, April 1986,
[000104] or (root) MUSIC, see
[000105] [14] R. Schmidt, "Multiple emitter location and signal parameter estimation," IEEE Transactions on Antennas and Propagation, vol. 34, no. 3, pp. 276-280, 1986
[000106] to pressure signals transformed in the time-frequency domain.
[000107] In figure 15, two real space microphones, here two arrays of real space microphone 410, 420 are illustrated. The two estimated DOA's al(k, n) and a2(k, n) are represented by two lines, a first line 430 representing the DOA al(k, n) and a second line 440 representing the DOA a2(k, n) . Triangulation is possible through simple arithmetic considerations knowing the position and orientation of each matrix.
[000108] Triangulation fails when the two lines 430, 440 are exactly parallel. In real applications, however, this is very unlikely. However, in all triangulation results correspond to a physical or practical position for the sound event in the considered space. For example, the estimated position of the sound event may be too far away or still outside the assumed space, indicating the probability that the DOA's do not correspond to any sound event that can be physically interpreted with the model used. These results could be caused by sensor noise or very strong ambient reverberation. Thus, according to one application, these unwanted results are marked so that the information computing module 202 can handle them correctly.
[000109] Figure 16 describes a scenario, where the position of a sound event is estimated in 3D space. Correct spatial microphones are employed, for example a flat or 3D microphone array. In Figure 16, a first space microphone 510, for example a first 3D microphone array, and a second space microphone 520, for example a first 3D microphone array, are illustrated. DOA in 3D space, for example, can be expressed as azimuth and elevation. Unit vectors 530, 540 can be used to express the DOA's. Two lines 550, 560 are designed according to DOA's. In 3D, even with very reliable estimates, the two lines 550, 560 projected according to DOA's may not cross. However, triangulation can still be performed, for example, choosing the midpoint of the smallest segment connecting the two lines.
[000110] Similar to the 2D case, triangulation can fail or can produce impractical results for certain combinations of directions that can then be marked, for example, to the information computation module 202 of figure 14.
[000111] If more than two space microphones exist, several solutions are possible. For example, the triangulation explained above could be performed for all pairs of real spatial microphones (if N = 3, 1 with 2, 1 with 3, and 2 with 3). The resulting positions can then be varied (by x and y, and, if 3D is considered, z).
[000112] Alternatively, more complex concepts can be used. For example, probabilistic approaches can be applied as described in
[000113] [15] J. Michael Steele. "Optimal Triangulation of Random Samples in the Plane", The Annals of Probability, Vol. 10, No.3 (Aug., 1982), pp. 548-553.
[000114] According to an application, the sound field can be analyzed in the time-frequency domain, for example, obtained through a short-duration Fourier transform (STFT), in which ken denote the frequency index k and index of time n, respectively. The complex pressure Pv(k, n) at an arbitrary position pv for a given ken is modeled as a single spherical wave emitted by a narrowband isotropic point-type source, for example, using the formula:

[000115] where PIPLS(k, n) is the signal emitted by IPLS at its piPLs(k, n) position. The complex factor y(k, PIPLSZ Pv) expresses the propagation of PiPLs(k, n) to pv, for example, it introduces appropriate phase and magnitude changes. Here, the assumption can be applied that at each time-frequency position only one IPLS is active. Independently, multiple narrowband IPLS's located in different positions can still be active in a single instant of time.
[000116] Each IPLS models direct sound or a distinct ambient reflection. Its piPLs(k, n) position can ideally correspond to an actual sound source located inside the room, or a mirror image sound source located outside, respectively. Thus, the position pIPLS(k, n) can still indicate the position of a sound event.
[000117] Please note that the term "actual sound sources" denotes the actual sound sources physically existing in the recording environment, such as transmitters or musical instruments. By contrast, with "sound sources" or "sound events" or "IPLS" we refer to effective sound sources, which are active at certain instants of time or at certain time-frequency positions, where sound sources can, for example , represent real sound sources or mirror image sources.
[000118] Figures 33a-33b illustrate microphone arrays that locate sound sources. Localized sound sources can have different physical interpretations depending on their nature. When microphone arrays receive direct sound, they can locate the position of a true sound source (eg transmitters). When microphone arrays receive reflections, they can locate the position of a mirror image source. Mirror image sources are also sound sources.
[000119] Figure 33a illustrates a scenario, where two microphone arrays 151 and 152 receive direct sound from an actual sound source (a physically existing sound source) 153.
[000120] Figure 33b illustrates a scenario, where two microphone arrays 161, 162 receive reflected sound, where the sound was reflected by a wall. Because of reflection, microphone arrays 161, 162 locate the position, where sound appears coming, at a position of a mirror image source 165, which is different from the position of speaker 163.
[000121] Both the actual sound source 153 of Fig. 33a as well as the mirror image source 165 are sound sources.
[000122] Figure 33c illustrates a scenario, where two microphone arrays 171, 172 receive diffuse sound and cannot locate a sound source.
[000123] While this single wave model is only accurate for lightly reverberant environments since the source signals realize the W-disjoint orthogonality (WDO) condition, ie the time-frequency overlap is sufficiently small. This is typically true for speech signals, see, for example,
[000124] [12] S. Rickard and Z. Yilmaz. "On the approximate W-disjoint orthogonality of speech," in Acoustics, Speech and Signal Processing, 2002. ICASSP 2002. IEEE International Conference on, April 2002, vol. 1.
[000125] However, the model still provides a good estimate for other environments and is therefore applicable for these environments.
[000126] Next, the estimation of piPLs(k, n) positions according to an application is explained. The PiPLs(k, n) position of an active IPLS at a given time-frequency position, and thus the estimate of a sound event at a time-frequency position, is estimated through triangulation based on the direction of arrival ( DOA) of sound measured at least at two different observation points.
[000127] Figure 17 illustrates a geometry, where the IPLS of the current time-frequency compartment (k, n) is located at the unknown position PiPLs(k, n) . To determine the required DOA information, two real spatial microphones, here two microphone arrays, are employed having a known geometry, position and orientation which are placed at positions 610 and 620, respectively. Vectors p2 and p2 point at positions 610, 620, respectively. The matrix orientations are defined by the unit vectors Ci and c2. The DOA of the sound is determined at positions 610 and 620 for each (k, n) using a DOA estimation algorithm, for example, as provided by DirAC analysis (see [2], [3]). Therefore, a first vector of the unity of the point of view
and a second unity vector of the point of view
with respect to a viewpoint of the microphone arrays (both not shown in figure 17) can be provided as an output of the DirAC analysis. For example, when operating in 2D, the first viewpoint unit vector results in:

[000128] Here, cpi(k, n) represents the azimuth of the estimated DOA in the first microphone array, as described in figure 17. The corresponding DOA unit vectors θi(k, n) and e2(k, n) , with respect to the global coordinate system at the origin, can be calculated by applying the formula:

[000129] where R are coordinate transformation matrices, for example,

[000130] when operating in 2D and C1=[c1x,c1y]TTo perform the triangulation, the direction vectors di(k, n) and d2(k, n) can be calculated as:

[000131] where djk, n) = lldjk, n) | | and d2(k, n) = | |d2(k, n) II are the unknown distances between the IPLS and the two microphone arrays. The following equation

[000132] can be solved for di(k, n) . Finally, the position PiPLs(k, n) of IPLS is given by

[000133] In another application, equation (6) can be solved for d2(k, n) and PiPLS(k, n) is similarly calculated using d2(k, n) .
[000134] Equation (6) always provides a solution when operating in 2D, unless ex(k, n) and e2(k, n) are parallel. However, when using more than two microphone arrays or when operating in 3D, a solution cannot be obtained when the d direction vectors do not intersect. According to an application, in this case, the point that is closest to all d direction vectors must be calculated and the result can be used as the position of the IPLS.
[000135] In an application, all observation points pi, p2, ... should be located so that the sound emitted by IPLS falls in the same time block n. This requirement can simply be fulfilled when the distance Δ between any two observation points is less than
(8)
[000136] where nFFT is the length of the STFT window, 0 < R < 1 specifies the overlap between successive time frames and fs is the sampling frequency. For example, for a 1024 point STFT at 48 kHz with 50% overlap (R = 0.5), the maximum spacing between arrays to comply with the above requirement is Δ = 3.65 m.
[000137] In the following, an information computation module 202, for example a virtual microphone signal and side information computation module, according to an application is described in more detail.
[000138] Figure 18 illustrates a schematic overview of an information computation module 202 according to an application. The information computing unit comprises a propagation compensator 500, a combiner 510 and a spectral weighting unit 520. The information computing module 202 receives the sound source position estimates ssp estimated by a sound event position estimator , one or more audio input signals is recorded by one or more of the real space microphones, posRealMic positions of one or more of the real space microphones, and the virtual posVmic position of the virtual microphone. Outputs an audio output signal representing an audio signal from the virtual microphone.
[000139] Figure 19 illustrates an information computation module according to another application. The information computing module of Fig. 19 comprises a propagation compensator 500, a combiner 510 and a spectral weighting unit 520. The propagation compensator 500 comprises a propagation parameter computing module 501 and a propagation compensation module 504 The combiner 510 comprises a combination factor computation module 502 and a combination module 505. The spectral weighting unit 520 comprises a spectral weight computation unit 503, a spectral weighting application module 506 and a computation module of spatial lateral information 507.
[000140] To compute the virtual microphone audio signal, the geometric information, for example, the position and orientation of the real space microphones 121 . . . 12N, the position, orientation and characteristics of the virtual space microphone 104, and the position estimates of the sound events 205 are input to the information computing module 202, in particular, to the propagation parameter computing module 501 of the propagation compensator 500, to the combination factor compensation module 502 of the combiner 510 and to the spectral weight computing unit 503 of the spectral weighting unit 520. The propagation parameter computing module 501, the combination factor compensation module 502 and the spectral weight computation unit 503 calculates the parameters used in modifying the audio signals 111 . . . 11N in propagation compensation module 504, combination module 505, and spectral weighting application module 506.
[000141] In the information computing module 202, the audio signals 111 . . . 11N can first be modified to compensate for the effects given by the different propagation lengths between the sound event positions and the real space microphones. The signals can then be combined to improve, for example, the signal-to-noise (SNR) index. Finally, the resulting signal can then be spectrally weighted to account for the directional collection pattern of the virtual microphone, as well as any distance-dependent gain functions. These three steps are discussed in more detail below.
[000142] Propagation compensation is now explained in more detail. At the top of Figure 20, two real space microphones (a first microphone array 910 and a second microphone array 920), the position of a located sound event 930 to the time-frequency (k, n) position, and the position of the virtual space microphone 940 are illustrated.
[000143] The lower part of figure 20 depicts a temporal axis. It is assumed that a sound event is emitted at time t0 and then propagates to real and virtual space microphones. Arrival delays as well as amplitudes change with distance, so other propagation length, the weaker the amplitude the longer the arrival delay will be.
[000144] The signals in the two real matrices are comparable only if the relative delay Dtl2 between them is small. Otherwise, one of the two signals needs to be temporally realigned to compensate for the relative delay Dtl2, and possibly to be scaled to compensate for the different declines.
[000145] Compensating the delay between arrival at the virtual microphone and arrival at the real microphone arrays (in one of the real space microphones) changes the delay regardless of the location of the sound event, making it superfluous for most applications.
[000146] Returning to Figure 19, the propagation parameter computation module 501 is adapted to calculate the delays to be corrected for each real spatial microphone and for each sound event. If desired, further calculates the gain factors to be considered to compensate for different amplitude declines.
[000147] The 504 propagation compensation module is configured to use this information to modify the audio signals correctly. If the signals are to be changed for a small amount of time (compared to the filter bank time window), then a simple phase rotation is sufficient. If delays are longer, more complicated implementations are needed.
[000148] The output of the propagation compensation module 504 is the modified audio signals expressed in the original time-frequency domain.
[000149] In the following, a particular propagation compensation estimate for a virtual microphone, according to an application, will be described with reference to figure 17 which, inter alia, illustrates a position 610 of a first real spatial microphone and position 620 of a second real space microphone.
[000150] In the application that is now explained, it is assumed that at least a first recorded audio input signal, eg a pressure signal from at least one of the real spatial microphones (eg the microphone arrays) is available, for example, the pressure signal from a first real space microphone. We refer to the microphone considered as the reference microphone, its position as the reference position pref and its pressure signal as the reference pressure signal Pref(k, n). However, propagation compensation can not only be conducted with respect to just one pressure signal, but also with respect to the pressure signals of a plurality or all real space microphones.
[000151] The relationship between the pressure signal PiPLstk, n) emitted by the IPLS and a reference pressure signal Pref(k, n) from a reference microphone located in pref can be expressed by formula (9):

[000152] In general, the complex factor y(k, pa, pb) expresses the phase rotation and amplitude decay introduced by the propagation of a spherical wave from its origin in pa to pb. However, practical tests indicated that considering only the amplitude decline in y leads to plausible impressions of the virtual microphone signal with little significant interference compared to another consideration of phase rotation.
[000153] The sound energy that can be measured at a given point in space depends a lot on the distance r from the sound source, in figure 6 the pIPLs position of the sound source. In many situations, this dependence can be modeled with sufficient precision using well-known physical principles, for example, the 1/r decay of sound pressure in the far field of a point source. When the distance of a reference microphone, for example, the first real microphone from the sound source is known, and when another distance of the virtual microphone from the sound source is known, then the sound energy at the position of the virtual microphone can be estimated from the signal and energy from the reference microphone, eg the first real space microphone. This means that the virtual microphone output signal can be obtained by applying correct gains to the reference pressure signal.
[000154] Assuming the first real spatial microphone is the reference microphone, then pref = pi. In Figure 17, the virtual microphone is located at pv. Since the geometry in Figure 17 is known in detail, the distance dx(k, n) = | |di(k,n)II between the reference microphone (in figure 17: the first real spatial microphone) and the IPLS can easily be determined, as well as the distance s(k,n) = I |s(k,n ) II between the virtual microphone and the IPLS, namely

[000155] The sound pressure Pv(k, n) at the microphone position calculated by combining formulas (1) and (9), leading

[000156] As mentioned above, in some applications, the y factors can only account for the amplitude decline due to propagation. Assuming, for example, that sound pressure decreases with 1/r, then

[000157] When the model in formula (1) remains, for example when only direct sound is present, then formula (12) can accurately reconstruct the magnitude information. However, in the case of the example fields, when the model assumptions are not fulfilled, the presented method produces an implicit de-reverberation of the signal by moving the virtual microphone away from the positions of the sensor arrays. In fact, as discussed above, in diffuse sound fields, we expect most IPLS to be located close to the two sensor arrays. Thus, by moving the virtual microphone away from these positions, we probably increase the distance s = | |sI I in Figure 17. Thus, the magnitude of the reference pressure is reduced by applying a weight according to formula (11). Correspondingly, by moving the virtual microphone close to a real sound source, the time-frequency positions corresponding to the direct sound will be amplified so that the entire audio signal will be perceived less diffusely. By adjusting the rule in formula (12), one can control the amplification of direct sound and suppression of diffuse sound at will.
[000158] By conducting propagation compensation on the recorded audio input signal (eg the pressure signal) of the first real spatial microphone, a first modified audio signal is obtained.
[000159] In applications, a second modified audio signal can be obtained by conducting propagation compensation on a second recorded audio input signal (second pressure signal) from the second real spatial microphone.
[000160] In other applications, other audio signals can be obtained by conducting propagation compensation on the other recorded audio input signals (other pressure signals) from other real space microphones.
[000161] Now, the combination in blocks 502 and 505 in figure 19 according to an application is explained in more detail. It is assumed that two or more audio signals from a plurality of different real space microphones have been modified to compensate for the different propagation passes to obtain two or more modified audio signals. Since the audio signals from the different real space microphones have been modified to compensate for the different propagation passes, they can be combined to improve the audio quality. By doing this, for example, the SNR can be increased or the reverberation can be reduced.
[000162] Possible solutions for the combination include:
[000163] Weighted average, for example, considering the SNR, or the distance to the virtual microphone, or the diffusion that was estimated by the real spatial microphones. Traditional solutions, for example, Maximum Ratio Combining (MRC I Maximum Ratio Combining) or Equal Gain Combining (EQC) can be employed, or
[000164] Linear combination of some or all modified audio signals to obtain a combination signal. Modified audio signals can be weighted in linear combination to obtain the combination signal, or
[000165] Selection, for example, of only one signal is used, for example, dependent on SNR or distance or diffusion.
[000166] The task of module 502 is, if applicable, to calculate the parameters for the combination, which is performed in module 505.
[000167] Now, the spectral weighting according to applications is described in more detail. For this, reference is made to blocks 503 and 506 of Figure 19. In this final step, the audio signal resulting from the combination or propagation compensation of the input audio signals is weighted in the time-frequency domain according to the spatial characteristics of the virtual space microphone as specified by input 104 and/or according to the reconstructed geometry (given at 205).
[000168] For each time-frequency position the geometric reconstruction allows us to easily obtain the DOA with respect to the virtual microphone, as shown in figure 21. Furthermore, the distance between the virtual microphone and the position of the sound event can still be readily available. calculated.
[000169] The weight for the time-frequency position is then calculated considering the desired virtual microphone type.
[000170] In the case of directional microphones, spectral weights can be calculated according to a predefined collection pattern. For example, according to an application, a cardioid microphone may have a collection pattern defined by the function g(theta), g(theta) = 0.5 + 0.5 cos(theta),
[000171] where theta is the angle between the viewing direction of the virtual spatial microphone and the DOA of the sound from the point of view of the virtual microphone.
[000172] Another possibility is the functions of artistic (non-physical) decline. In certain applications, it may be desired to suppress sound events far from the virtual microphone with a factor greater than a characterizing free field propagation. For this purpose, some applications introduce an additional weighting function that depends on the distance between the virtual microphone and the sound event. In an application, only sound events originating from a certain distance (eg in meters) from the virtual microphone should be collected.
[000173] With respect to the directivity of the virtual microphone, arbitrary directivity patterns can be applied for the virtual microphone. By doing this, one can for example separate a source from a complex sound scene.
[000174] Since the DOA of the sound can be calculated at the pv position of the virtual microphone, namely

[000175] where cv is a unity vector describing the orientation of the virtual microphone, arbitrary directivities for the virtual microphone can be realized. For example, assuming that Pv(k,n) indicates the combination signal or the propagated-compensated modified audio signal, then the formula:

[000176] calculates the output of a virtual microphone with cardioid directivity. Directional patterns, which can potentially be generated in this way, depend on the accuracy of the position estimate.
[000177] In applications, one or more real non-spatial microphones, for example an omnidirectional microphone or a directional microphone such as a cardioid, are placed in the sound scene in addition to the real space microphones to further improve the sound quality of the audio signals. virtual microphone 105 in figure 8. These microphones are not used to obtain any geometric information, but only to provide a cleaner audio signal. These microphones can be placed closer to sound sources than space microphones. In this case, according to an application, the audio signals from the real non-spatial microphones and their positions are simply input to the propagation compensation module 504 of Fig. 19 to process, instead of the audio signals from the real space microphones. Propagation compensation is then conducted for one or more audio signals recorded from the non-spatial microphones with respect to the position of the one or more non-spatial microphones. Therefore, an application is performed using additional non-spatial microphones.
[000178] In another application, the calculation of the spatial lateral information of the virtual microphone is performed. To calculate the spatial lateral information 106 of the microphone, the information computing module 202 of Fig. 19 comprises a spatial lateral information computing module 507, which is adapted to receive as input the positions of the sound sources 205 and the position, orientation and 104 features of the virtual microphone. In certain applications, according to the lateral information 106 that needs to be calculated, the audio signal from the virtual microphone 105 can still be considered as input to the spatial lateral information computing module 507.
[000179] The output of the computation module of spatial lateral information 507 is the lateral information of the virtual microphone 106. This lateral information can be, for example, the DOA or the sound diffusion for each position of time-frequency (k, n ) from the point of view of the virtual microphone. Another possible lateral information could, for example, be the active sound intensity vector Ia(k, n) that would be measured at the position of the virtual microphone. How these parameters can be derived will now be described.
[000180] According to an application, the estimation of DOA for the virtual space microphone is performed. The information computing module 120 is adapted to estimate the direction of arrival at the virtual microphone as spatial lateral information, based on a virtual microphone position vector and based on a sound event position vector as illustrated by Fig. 22 .
[000181] Figure 22 describes a possible way to derive the DOA of the sound from the point of view of the virtual microphone. The position of the sound event, given by block 205 in Fig. 19, can be described for each time-frequency (k, n) position with a position vector r(k, n), the position vector of the sound event. . Similarly, the position of the virtual microphone, provided as input 104 in Figure 19, can be described with a position vector s(k,n), the position vector of the virtual microphone. The viewing direction of the virtual microphone can be described by a vector v(k, n) . The DOA with respect to the virtual microphone is given by a(k,n). Represents the angle between v and the sound propagation pass h(k,n). h(k, n) can be calculated using the formula: h(k, n) = s(k,n) - r(k, n) ,
[000182] The desired DOA a(k, n) can now be calculated for each (k, n) for example by defining the inner product of h(k, n) and v(k,n), viz.
[000183] a(k, n) = arcs (h(k, n) • v(k,n) / ( |h(k, n)| I|v(k,n)II ).
[000184] In another application, the information computing module 120 can be adapted to estimate the intensity of the sound active in the virtual microphone as spatial lateral information, based on a position vector of the virtual microphone and based on a position vector of the sound event as illustrated by figure 22.
[000185] From the DOA a(k, n) defined above, we can derive the intensity of the active sound Ia(k, n) at the virtual microphone position. For this, it is assumed that the audio signal from the virtual microphone 105 in Fig. 19 corresponds to the output of an omnidirectional microphone, for example, we assume that the virtual microphone is an omnidirectional microphone. Furthermore, the viewing direction v in Figure 22 is assumed to be parallel to the x-axis of the coordinate system. Since the desired active sound intensity vector Ia(k, n) describes the net flow of energy through the position of the virtual microphone, we can calculate Ia(k, n) can be calculated, for example, according to the formula:
[000186]

[000187] where [ ]T denotes a transposed vector, rho is the air density, and Pv (k, n) is the sound pressure measured by the virtual space microphone, eg output 105 of block 506 in Figure 19.
[000188] If the active intensity vector is to be calculated expressed in the general coordinate system, but still in the position of the virtual microphone, the following formula can be applied:
[000189]

[000190] Sound diffusion expresses how diffuse the sound field is in a given time-frequency compartment (see, for example, [2]) . Diffusion is expressed by the value x[r, where 0 is p is 1. A diffusion of 1 indicates that the total sound energy field of a sound field is completely diffused. This information is important, for example, in the reproduction of spatial sound. Traditionally, diffusion is calculated at the specific point in space where a microphone array is placed.
[000191] According to an application, diffusion can be calculated as an additional parameter to the lateral information generated for the virtual microphone (VM), which can be placed in an arbitrary position in the sound scene. Because of this, a device that still calculates the diffusion beyond the audio signal at a virtual position of a virtual microphone can be seen as a front virtual DirAC, as it is possible to produce a DirAC stream, namely an audio signal, direction of arrival, and diffusion, to an arbitrary point in the sound scene. The DirAC stream can be further processed, stored, transmitted and played back in an arbitrary multi-speaker configuration. In this case, the listener sees the sound scene as if he or she were in the position specified by the virtual microphone and were looking in the direction determined by this orientation.
[000192] Figure 23 illustrates an information calculation block according to an application comprising a diffusion calculation unit 801 to calculate diffusion in the virtual microphone. The information computation block 202 is adapted to receive inputs 111 to 11N, which in addition to the inputs of Fig. 14 still include broadcasting in real spatial microphones. Let ΦtSM11 to ^(SMN) denote these values. These additional inputs are input to the information computing module 202. The output 103 of the diffusion calculation unit 801 is the diffusion parameter calculated at the position of the virtual microphone.
[000193] A diffusion calculation unit 801 of an application is illustrated in Figure 24 describing more details. According to one application, the direct and diffuse sound energy in each of the N space microphones is estimated. Then, using the information about the positions of the IPLS, and the information about the positions of the spatial and virtual microphones, N estimates of these energies at the position of the virtual microphone are obtained. Finally, the estimates can be combined to improve the accuracy of the estimate and the virtual microphone diffusion parameter can be readily calculated.
[000194] Leave
denote the estimates of the diffuse and direct sound energies for the N spatial microphones calculated by the energy analysis unit 810. If Pi is the complex pressure signal and Φi is the diffusion for the i-th spatial microphone, then the energies can, for example, be calculated according to the formula:

[000195] The diffused sound energy must be equal in all positions, thus an estimate of the Ediff diffused sound energy in the virtual microphone can be calculated simply by averaging
, for example, in a diffusion combination unit 820, for example, according to the formula:

[000196] A more effective combination of estimates
The
could be performed considering the variance of the evaluators, for example, considering the SNR.
[000197] The energy of direct sound depends on the distance to the source due to propagation. Thus,
The
can be modified to account for this. This can be accomplished, for example, by the direct sound propagation adjustment unit 830. For example, if the energy of the direct sound field is assumed to decline by 1 over the squared distance, then the estimate for the sound direct from the virtual microphone to the i-th space microphone can be calculated according to the formula:

[000198] Similar to the diffusion combining unit 820, the direct sound energy estimates obtained in different spatial microphones can be combined, for example, by the direct sound combining unit 840. The result is EjArMl , eg the estimate for direct sound energy into the virtual microphone. The diffusion in the virtual microphone can be calculated, for example, by the diffusion subcalculator 850, for example, according to the formula:

[000199] As mentioned above, in some cases, the estimation of the position of the sound events performed by the evaluator of the position of the sound events fails, for example, in case of a wrong estimation of the direction of arrival. Figure 25 illustrates such a scenario. In these cases, regardless of the estimated diffusion parameters in the different spatial microphones and as received as inputs 111 to 11N, the diffusion for virtual microphone 103 can be set to 1 (ie, completely diffuse), as no spatially coherent reproduction is possible.
[000200] Additionally, the reliability of the DOA estimates in the N spatial microphones can be considered. This can be expressed, for example, in terms of the variance of the DOA or SNR evaluator. Such information can be considered by the 850 diffusion subcalculator, so that the VM 103 diffusion can be artificially increased in the case that the DOA estimates are unreliable. In fact, as a consequence, the estimates for position 205 will still be unreliable.
[000201] Figure 2a illustrates an apparatus 150 for generating at least one audio output signal based on an audio data stream comprising audio data referring to one or more sound sources according to an application.
[000202] The apparatus 150 comprises a receiver 160 for receiving the audio data stream comprising the audio data. Audio data comprises one or more pressure values for each of one or more sound sources. Furthermore, the audio data comprises one or more position values indicating a position of one of the sound sources for each of the sound sources. Further, the apparatus comprises a synthesis module 170 for generating at least one audio output signal based on at least one of one or more pressure values of the audio data of the audio data stream and based on at least one. minus one of one or more audio data position values of the audio data stream. Audio data is set to a time-frequency position of a plurality of time-frequency positions. For each of the sound sources, at least one pressure value is comprised in the audio data, wherein at least one pressure value may be a pressure value referring to an emitted sound wave, e.g. originating from the sound source. The pressure value can be a value of an audio signal, for example a pressure value of an audio output signal generated by an apparatus for generating an audio output signal from a virtual microphone, wherein this virtual microphone is placed at the position of the sound source.
[000203] Thus, figure 2a illustrates an apparatus 150 that can be employed to receive or process the mentioned audio data stream, that is, the apparatus 150 can be employed on one side of the receiver/synthesis. The audio data stream comprises audio data comprising one or more pressure values and one or more position values for each of a plurality of sound sources, i.e. each of the pressure values and the position values if refers to a particular sound source from one or more sound sources of the recorded audio scene. This means that the position values indicate the positions of sound sources rather than recording microphones. Regarding the pressure value this means that the audio data stream comprises one or more pressure values for each of the sound sources, that is, the pressure values indicate an audio signal that is related to a sound source instead to be related to a recording of a real spatial microphone.
[000204] According to an application, the receiver 160 can be adapted to receive the audio data stream comprising the audio data, wherein the audio data further comprises one or more broadcast values for each of the sources sound. Synthesis module 170 can be adapted to generate at least one audio output signal based on at least one of one or more broadcast values.
[000205] Figure 2b illustrates an apparatus 200 for generating an audio data stream comprising sound source data referring to one or more sound sources according to an application. Apparatus 200 for generating an audio data stream comprises a determiner 210 for determining the sound source data based on at least one audio input signal recorded by the at least one. spatial microphone and based on lateral audio information provided by at least two spatial microphones. Further, apparatus 200 comprises a data stream generator 220 for generating the audio data stream so that the audio data stream comprises the sound source data. Sound source data comprises one or more pressure values for each of the sound sources. In addition, the sound source data further comprises one or more position values indicating a sound source position for each of the sound sources. Further, the sound source data is set to a time-frequency position of a plurality of time-frequency positions.
[000206] The audio data stream generated by the apparatus 200 can then be transmitted. Thus, the apparatus 200 can be used on an analysis/transmitter side. The audio data stream comprises audio data comprising one or more pressure values and one or more position values for each of a plurality of sound sources, i.e. each of the pressure values and the position values if refers to a particular sound source from one or more sound sources of the recorded audio scene. This means that with respect to position values, position values indicate positions of sound sources rather than recording microphones.
[000207] In another application, the determiner 210 can be adapted to determine the sound source data based on broadcast information by at least one spatial microphone. Data stream generator 220 may be adapted to generate the audio data stream so that the audio data stream comprises the sound source data. The sound source data further comprises one or more diffusion values for each of the sound sources.
[000208] Figure 3a illustrates an audio data stream according to an application. The audio data stream comprises audio data referring to two sound sources being active in a time-frequency position. In particular, Figure 3a illustrates the audio data that is transmitted to a time-frequency (k,n) position, where k denotes the frequency index and n denotes the time index. The audio data comprises a pressure value Pl, a position value Q1 and a spread value i[rl from a first sound source. The position value Q1 comprises three coordinate values XI, Y1 and Z1 indicating the position of the first sound source. Furthermore, the audio data comprises a pressure value P2, a position value Q2 and an i|i 2 diffusion value of a second sound source. The Q2 position value comprises three coordinate values X2, Y2 and Z2 indicating the position of the second sound source.
[000209] Figure 3b illustrates an audio stream according to another application. Again, the audio data comprises a pressure value Pl, a position value Q1 and a diffusion value ~1 from a first sound source. The position value Q1 comprises three coordinate values XI, Y1 and Z1 indicating the position of the first sound source. Furthermore, the audio data comprises a P2 pressure value, a Q2 position value and an I1 2 diffusion value of a second sound source. The position value Q2 comprises three coordinate values X2, Y2 and Z2 indicating the position of the second sound source.
[000210] Figure 3c provides another illustration of the audio data stream. As the audio data stream provides geometry-based spatial audio coding (GAC) information based on geometry, it is still referred to as "geometry-based spatial audio coding stream" or "GAC stream". . The audio data stream comprises information that refers to one or more sound sources, for example, one or more isotropic point-like source (IPLS). As explained above, the GAC flow can comprise the following signals, where ken denote the frequency index and the time index of the considered time-frequency position: • P(k, n) : Complex pressure in the sound source, by example, in IPLS. This signal possibly comprises direct sound (the sound originating from the IPLS itself) and diffuse sound. • Q(k,n): Position (eg Cartesian coordinates in 3D) of the sound source, eg IPLS: the position can, for example, comprise Cartesian coordinates X(k,n) , Y(k,n ), Z(k,n). • IPLS broadcast: Φ(k,n). This parameter is related to the power index of the direct to diffuse sound comprised in
so, a possibility to express the diffusion

is known, other equivalent representations are conceivable, for example, the Direct to Diffuse Ratio (DDR I Direct to Diffuse Ratio)
.
[000211] As already mentioned, ken denote the indices of frequency and time, respectively. If desired and if the analysis allows it, more than one IPLS can be represented in a given time-frequency slot. This is described in Figure 3c as M multiple layers, so that the pressure signal for the i-th layer (ie for i-th IPLS) is denoted with Pi(k, n). For convenience, the position of the IPLS can be expressed as the vector
. Unlike the prior art, all parameters in the GAC stream are expressed with respect to one or more sound sources, for example, with respect to IPLS, thus achieving independence of the recording position. In figure 3c, as well as in figure 3a and 3b, all quantities in the figure are considered in the time-frequency domain; the (k,n) notation has been ignored for simplicity reasons, eg Pi means Pi(k,n), eg

[000212] In the following, an apparatus for generating an audio data stream according to an application is explained in more detail. Like the apparatus of Fig. 2b, the apparatus of Fig. 4 comprises a determinant 210 and a data stream generator 220 which may be similar to the determinant 210. As the determinant analyzes the input audio data to determine the sound source data with On the basis of which the data stream generator generates the audio data stream, the determinant and the data stream generator can together be referred to as an "analysis module" (see analysis module 410 in figure 4).
[000213] The analysis module 410 calculates the GAC stream of recordings from the N space microphones. Depending on the number M of desired layers (eg the number of sound sources for which information must be understood in the audio data stream for a particular time-frequency position), the type and number N of spatial microphones, different methods for analysis are conceivable. Some examples are given below.
[000214] As a first example, the parameter estimate for a sound source, eg an IPLS, by time-frequency compartment is considered. In the case of M = 1, the GAC flow can be readily obtained with the concepts explained above for the device to generate an audio output signal from a virtual microphone, in which a virtual space microphone can be placed in the position of the sound source, by example, in the position of IPLS. This allows pressure signals to be calculated at the IPLS position, along with corresponding position estimates, and possibly the diffusion. These three parameters are grouped together in a GAC stream and can be further manipulated by module 102 in Figure 8 before being transmitted or stored.
[000215] For example, the determiner can determine the position of a sound source using the concepts proposed for estimating the position of the sound events of the device to generate an audio output signal from a virtual microphone. Further, the determiner may comprise an apparatus for generating an audio output signal and may use the position of the sound source determined as the position of the virtual microphone to calculate pressure values (e.g., audio output signal values at be generated) and the diffusion at the position of the sound source.
[000216] In particular, the determiner 210, for example in figure 4), is configured to determine the pressure signals, the corresponding position estimates, and the corresponding diffusion, while the data flow generator 220 is configured to generate the audio data stream based on calculated pressure signals, position estimates and diffusion.
[000217] As another example, the parameter estimate for 2 sound sources, eg 2 IPLS, per time-frequency compartment is considered. If the analysis module 410 is to estimate two sound sources per time-frequency position, then the following concept based on prior art evaluators can be used.
[000218] Figure 5 illustrates a sound scene composed of two sound sources and two uniform linear microphone arrays. Reference is made to ESPRIT, see
[000219] [26] R. Roy and T. Kailath. ESPRIT-estimation of signal parameters via rotational invariance techniques. Acoustics, Speech and Signal Processing, IEEE Transactions on, 37(7):984-995, July 1989.
[000220] ESPRIT ([26]) can be used separately in each matrix to obtain two DOA estimates for each time-frequency position in each matrix. Due to an ambiguity of pairing, this leads to two possible solutions for the position of sources. As can be seen from Figure 5, the two possible solutions are given by (1, 2) and (1' , 2'). To resolve this ambiguity, the following solution can be applied. The signal emitted from each source is estimated using a beamformer oriented towards the positions of the estimated source and applying a correct factor to compensate for propagation (eg, multiplying by the inverse of the attenuation presented by the wave). This can be done for each source in each matrix for each of the possible solutions. We can then define an estimation error for each pair of sources (i, j) as:

[000221] where (i, j) £ {(1, 2), (1', 2')} (see figure 5) and Pi#1 is responsible for the energy of the compensated signal seen by the r matrix of the sound source i . The error is minimal for the true sound source pair. Once the matching issue is resolved and the correct DOA estimates are calculated, these are grouped together, along with the estimates of the pressure and diffusion signals corresponding to a GAC flow. The estimates of pressure and diffusion signals can be obtained using the same method already described for parameter estimation for a sound source.
[000222] Figure 6a illustrates an apparatus 600 for generating at least one audio output signal based on an audio data stream according to an application. Apparatus 600 comprises a receiver 610 and a synthesis module 620. Receiver 610 comprises a modification module 630 for modifying the audio data of the received audio data stream by modifying at least one of the pressure values of the audio data , at least one of the position values of the audio data or at least one of the diffusion values of the audio data relating to at least one of the sound sources.
[000223] Figure 6b illustrates an apparatus 660 for generating an audio data stream comprising sound source data referring to one or more sound sources according to an application. The apparatus for generating an audio data stream comprises a determiner 670, a data stream generator 680 and another modifying module 690 for modifying the audio data stream generated by the data stream generator by modifying at least one of the values of pressure of the audio data, at least one of the position values of the audio data or at least one of the diffusion values of the audio data relating to at least one of the sound sources.
[000224] While the modification module 610 of figure 6a is employed on a receiver/synthesis side, the modification module 660 of figure 6b is employed on a transmitter/analysis side.
[000225] Modifications of the audio data stream driven by the modification module 610, 660 can still be considered as sound scene modifications. Thus, the modification module 610, 660 may still be referred to as the sound scene manipulation modules.
[000226] The representation of the sound field provided by the GAC stream allows different types of modifications of the audio data stream, that is, as a consequence, manipulations of the sound scene. Some examples in this context are: 1. Expanding arbitrary sections of space/volumes in the sound scene (eg, expanding a point-type sound source to make it appear wider to the listener); 2. Transform a selected section of space/volume to any other arbitrary section of space/volume in the sound scene (the transformed space/volume could, for example, contain a font that is required to be moved to a new location); 3. Filter based on position, where selected regions of the sound scene are enhanced or partially/completely suppressed
[000227] Next a layer of an audio data stream, for example a GAC stream, is assumed to comprise all audio data from one of the sound sources with respect to the particular time-frequency position.
[000228] Figure 7 describes a modification module according to an application. The modification unit of Fig. 7 comprises a demultiplexer 401, a manipulation processor 420 and a multiplexer 405.
[000229] The demultiplexer 401 is configured to separate the different layers of the M layer GAC stream and form the single M layer GAC's streams. Furthermore, the manipulation processor 420 comprises units 402, 403 and 404, which are applied in each one. of the GAC flows separately. In addition, the multiplexer 405 is configured to form the M-layer GAC stream from the manipulated single-layer GAC's streams.
[000230] Based on the position data of the GAC stream and knowledge of the position of real sources (eg transmitters), energy can be associated with a given real source for each time-frequency position. The P pressure values are then correctly weighted to modify the noise from the respective real source (eg loudspeaker) . This requires advance information or an estimate of the location of real sound sources (eg transmitters).
[000231] In some applications, if knowledge about the position of the real sources is available, then based on the position data from the GAC flow, energy can be associated with a given real source for each time-frequency position.
[000232] The manipulation of the audio data stream, for example, the GAC stream can occur in the modification module 630 of the apparatus 600 to generate at least one audio output signal of figure 6a, i.e., on one side of the receiver/synthesis and/or in the modification module 690 of the apparatus 660 to generate an audio data stream of Fig. 6b, i.e. on one side of the transmitter/analysis.
[000233] For example, the audio data stream, ie the GAC stream, can be modified before transmission, or before synthesis after transmission.
[000234] Unlike the modification module 630 of figure 6a on the receiver/synthesis side, the modification module 690 of figure 6b on the transmitter/analysis side can explain the additional information of inputs 111 to 11N (the recorded signals) and 121 s 12N (relative position and orientation of space microphones), as this information is available on the transmitter side. Using this information, a modification unit according to an alternative application can be realized, which is described in figure 8.
[000235] Figure 9 depicts an application illustrating a schematic overview of a system, in which a GAC stream is generated on one side of the transmitter/analysis, where optionally GAC stream can be modified by the modification module 102 on one side of the transmitter/analysis, where the GAC stream can optionally be modified on one side of the receiver/synthesis by the modification module 103 and where the GAC stream is used to generate a plurality of audio output signals 191...19L.
[000236] On the transmitter/analysis side, the representation of the sound field (for example, the GAC flow) is calculated in unit 101 of inputs 111 to 11N, that is, the signals recorded with N > 2 spatial microphones, and of the inputs 121 to 12N, that is, positive position and orientation of the space microphones.
[000237] The output of unit 101 is the representation of the previously mentioned sound field, which in the following is denoted as spatial audio coding stream (GAC) based on geometry. Similar to the proposal in
[000238] [20] Giovanni Del Galdo, Oliver Thiergart, Tobias Weller, and E.A.P. Habets. Generating virtual microphone signals using geometrical information gathered by distributed arrays. In Third Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA 'll), Edinburgh, United Kingdom, May 2011,
[000239] and as described for the apparatus to generate an audio output signal from a virtual microphone in a configurable virtual position, a complex sound scene is modeled by means of sound sources, for example, sound isotropic point-type sources ( IPLS), which are active in specific compartments in a time-frequency representation, such as one provided by the Short Duration Fourier Transform (STFT).
[000240] The GAC stream may be further processed in optional modification module 102, which may further be referred to as a handling unit. Modification module 102 allows for various applications. The GAC stream can then be streamed or stored. The parametric nature of the GAC flow is highly efficient. On the synthesis/receiver side, a more optional modification module (handling units) 103 can be employed. The resulting GAC stream enters the synthesis unit 104 which generates the speaker signals. Given the independence of the recording representation, the end user on the playback side can potentially manipulate the sound scene and decide the listening position and orientation within the sound scene freely.
[000241] The modification/manipulation of the audio data stream, for example, the GAC stream can occur in modification module 102 and/or 103 in figure 9, modifying the GAC stream correctly both before transmission in module 102 and after the transmission before synthesis 103. Unlike the modification module 103 on the receiver/synthesis side, the modification module 102 on the transmitter/analysis side can explain the additional information of inputs 111 to 11N (the audio data provided by the space microphones) and 121 to 12N (Positive Position and Orientation of Space Microphones) as this information is available on the transmitter side. Figure 8 illustrates an alternative application of a modification module that employs this information.
[000242] Examples of different concepts for handling the GAC flow are described below with reference to figure 7 and figure 8. Units with equal reference signs have equal function. 1. volume expansion
[000243] It is assumed that the given energy in the scene is located within volume V. The volume V can indicate a predefined area of an environment, θ denotes the set of time-frequency (k, n) positions in which the sources corresponding voices, eg IPLS, are located within volume V.
[000244] If expansion of volume V into another volume V' is desired, this can be achieved by adding a random term to the position data in the GAC stream whenever k, n) G β (evaluated in decision units 403) and replacing Q(k,n) = [X(k,n), Y(k,n) ,Z(k,n)]T (the index layer is reduced for simplicity) so that the outputs 431 to 43M of the units 404 in figure 7 and 8 become
[000245]

[000246] where
they are random variables whose range depends on the geometry of the new volume V' with respect to the original volume V. This concept can, for example, be employed to become a wider sound source. In this example, the original volume V is infinitely small, ie the sound source, eg IPLS, must be located at the same point Q(k, n) = [X(k, n) , Y(k, n) ) , Z(k, n)]T for all (k, n) G θ. This mechanism can be seen as a way of hesitating the position parameter Q(k, n).
[000247] According to an application, each of the position values of each of the sound sources comprises at least two coordinate values, and the modification module is adapted to modify the coordinate values by adding at least one random number to the values coordinate values, when coordinate values indicate that a sound source is located at a position within a predefined area of a room. 2. Volume transformation
[000248] In addition to volume expansion, the GAC stream position data can be modified to relocate the space/volume sections within the sound field. In this case, the data to be manipulated comprise the spatial coordinates of localized energy.
[000249] V again denotes the volume that is to be relocated, and ® denotes the set of all time-frequency positions (k, n) in which energy is located within volume V. Again, volume V may indicate a predefine area of an environment.
[000250] Volume reallocation can be achieved by modifying the GAC flow so that for all time-frequency positions (k,n) 6 0, Q(k,n) are reallocated by f(Q(k, n)) at outputs 431 to 43M of 404 units, where f is a function of the spatial coordinates (X, Y, Z), describing the volume manipulation to be performed. The function f can represent a simple linear transformation such as rotation, translation, or any other complex non-linear mapping. This technique can be used, for example, to move sound sources from one position to another within the sound scene ensuring that © corresponds to the set of time-frequency positions in which the sound sources were located within volume V. The technique allows for a variety of complex manipulations of the entire sound scene, such as scene reflection, scene rotation, scene magnification and/or compression etc. For example, by applying an appropriate linear mapping on volume V, the complementary effect of volume expansion, ie volume shrinkage, can be obtained. This could, for example, be done by mapping Q(k,n) for (k,n) E © af(Q(k,n)) G V', where V' c V and V' comprise a volume significantly smaller than V.
[000251] According to an application, the modification module is adapted to modify the coordinate values by applying a deterministic function to the coordinate values, when the coordinate values indicate that a sound source is located at a position within a predefined area. of an environment. 3. Filtering based on position
[000252] The idea of filtering based on geometry (or filtering based on position) offers a method to improve or completely/partially remove sections of space/volumes from the sound scene. Compared to volume expansion and transformation techniques in this case, however, only the GAC flow pressure data is modified by applying appropriate scale weights.
[000253] In filtering based on geometry, a distinction can be made between the modify module on the transmitter side 102 and on the receiver side 103, in which the provision of one can use inputs 111 to 11N and 121 to 12N to assist in the calculation of appropriate filter weights, as described in figure 8. Assuming the objective is to supply/improve energy originating from a selected section of space/volume V, filtering based on geometry can be applied as follows:
[000254] For all (k, n) 6 ©, the complex pressure P(k, n) in the GAC flow is changed to r|P(k, n) at the outputs of 402, where q is a real weighting factor, for example, calculated by unit 402. In some applications, module 402 can be adapted to calculate a diffusion-dependent weighting factor as well.
[000255] The concept of filtering based on geometry can be used in a variety of applications, such as signal enhancement and source separation. Some of the applications and the necessary background information include: • De-reverberation. Knowing the geometry of the room, the spatial filter can be used to supply energy located outside the corners of the room that can be caused by multipass propagation. This application can be of interest, for example, for hands-free communication in meeting rooms and cars. Note that to make up for lagging reverberation, it is sufficient to close the filter in the case of high diffusion, where to make up for early reflections a position-dependent filter is more effective. In this case, as already mentioned, the geometry of the room needs to be known in advance. • Suppression of background noise. A similar concept can be used to supply background noise as well. If the potential regions where the sources may be located, (for example, participants' chairs in meeting rooms or car seats) are known, then energy located outside these regions is associated with background noise and is thus suppressed by spatial filter. This application requires prior information or an estimate, based on data available in the GAC flows, of the approximate location of the sources. • Suppression of a point-type interventionist. If the interventionist is clearly located in space, rather than diffusion, position-based filtering can be applied to attenuate the energy located in the interventionist's position. This requires advance information or an estimate of the interventionist's location. • Echo control. In this case the interventionists to be supplied are the loudspeaker signals. For this purpose, similarly to the point-type interventionist case, energy located exactly or close to the position of the speakers is supplied. This requires advance information or an estimate of speaker positions. • Improved voice detection. The signal enhancement techniques associated with the invention of geometry-based filtering can be implemented as a processing step in a conventional voice activity system, for example, in cars. De-reverberation, or noise suppression can be used as supplements to improve system performance. • Surveillance. Preserving only the energy of certain areas and supplying the rest is a technique commonly used in surveillance applications. This requires prior information about the geometry and location of the area of interest. • Separation from source. In an environment with multiple sources simultaneously active the spatial filter based on geometry can be applied for source separation. Placing a correctly designed spatial filter centered on the location of a source results in the suppression/attenuation of other sources that are simultaneously active. This innovation can be used, for example, as in SAOC. Prior information or an estimate of source locations is required. • Position Dependent Automatic Gain Control (AGC). Position-dependent weights can be used, for example, to equalize the noise of different transmitters in teleconferencing applications.
[000256] Next, the synthesis modules according to the applications are described. According to an application, a synthesis module can be adapted to generate at least one audio output signal based on at least one audio data pressure value of an audio data stream and based on at least one. minus one audio data position value of the audio data stream. At least one pressure value can be a pressure value of a pressure signal, for example an audio signal.
[000257] The principles of operation beyond GAC synthesis are motivated by the assumptions of spatial sound perception given in
[000258] [27] W02004077884: Tapio Lokki, Juha Merimaa, and Ville Pulkki. Method for reproducing natural or modified spatial impression in multichannel listening, 2006.
[000259] In particular, the spatial signals necessary to correctly perceive the spatial image of a sound scene can be obtained by correctly reproducing a non-diffuse sound arrival direction for each time-frequency position. The synthesis, described in figure 10a, is thus divided into two stages.
[000260] The first stage considers the position and orientation of the listener within the sound scene and determines which of M IPLS is dominant for each time-frequency position. Consequently, its pressure signal Pdlr and arrival direction θ can be calculated. The remaining and diffuse sound sources are collected in a second pressure signal Pdiff.
[000261] The second stage is identical to the second half of the DirAC synthesis described in [27]. Undiffused sound is reproduced with a sound position mechanism that produces a dot-type source, where the diffused sound is reproduced from all speakers after it has been decorrelated.
[000262] Figure 10a describes a synthesis module according to an application illustrating the synthesis of the GAC flow.
[000263] The unit stage of the 501 synthesis of the first stage calculates the pressure signals Pdir and Pdiff that need to be reproduced differently. In fact, while Pdir understands the sound that must be reproduced coherently in space, Pdiff understands diffuse sound. The third output of the first stage synthesis unit 501 is Direction of Arrival (DOA) θ 505 from the viewpoint of the desired listening position, that is, a direction of arrival information. Note that Direction of Arrival (DOA) can be expressed as an azimuth angle if 2D space, or by the pair of azimuth and elevation angle in 3D. Equivalently, a unit norm vector indicated in a DOA can be used. DOA specifies which direction (with respect to the desired listening position) the Pdir signal should come from. The first stage synthesis unit 501 takes the GAC stream as an input, that is, a parametric representation of the sound field, and calculates the aforementioned signals based on the position of the listener orientation specified by input 141. In fact, the user end can freely decide the listening position and orientation within the sound scene described by the GAC stream.
[000264] Second stage synthesis unit 502 calculates the L speaker signals 511 to 51L based on knowledge of speaker 131 configuration. Please remember that unit 502 is identical to the second half of DirAC synthesis described in [27].
[000265] Figure 10b describes a first unit of the synthesis stage according to an application. The input provided to the block is a GAC stream composed of M layers. In a first step, the unit 601 demultiplexes the M layers into a parallel GAC stream of one layer.
[000266] The i-th GAC flow comprises a Pi pressure signal, a diffusion
and a position vector Qi = [Xj., Yif Zi]T. The pressure signal Pi comprises one or more pressure values. The position vector is a position value. At least one audio output signal is now generated based on these values.
[000267] The pressure signal for diffuse and direct sound Pdir,ie Pdiff,i, is obtained from P by applying a correct factor derived from diffusion
. The pressure signals comprise direct sound that enters propagation compensation block 602 which calculates delays corresponding to the propagation of the signal from the sound source position, e.g., the position of the IPLS, to the position of the listener. In addition, the block even calculates the gain factors needed to compensate for different magnitude declines. In other applications, only different magnitude declines are compensated for, while delays are not compensated for.
[000268] Compensated pressure signals, denoted by
enter block 603, which outputs the iraax index of the strongest input

[000269] The main idea behind this mechanism is that the M IPLS active in the time-frequency position in the study, only the strongest one (with respect to the listener's position) will be reproduced coherently (ie as direct sound). Blocks 604 and 605 select from their inputs one that is defined by
. Block 607 calculates the arrival direction of
IPLS with respect to listener position and orientation (input 141). The exit from block 604
corresponds to the output of block 501, namely the sound signal Pdir which will be reproduced as direct sound by block 502. The diffuse sound, namely output 504 Pdief, comprises the sum of all the diffuse sound in the M branches as well as all the signals of direct sound
except for
to know

[000270] Figure 10c illustrates a second unit of the synthesis stage 502. As already mentioned, this stage is identical to the second half of the synthesis module proposed in [27] . The non-diffuse sound Pdir 503 is reproduced as a point-type source, for example, by sound position whose gains are calculated in block 701 based on the arrival direction (505). On the other hand, the diffuse sound, Pdiff, goes through distinct decorrelators L (711 to 71L) . For each of the speaker L signals, the diffuse and direct sound passes are added before passing through the inverse filter bank (703).
[000271] Figure 11 illustrates a synthesis module according to an alternative application. All quantities in the figure are considered in the time-frequency domain; the (k,n) notation was ignored for reasons of simplicity, eg Pi = Pi(k,n). To improve the audio quality of the reproduction in case of particularly complex sound scenes, for example several sources active at the same time, the synthesis module, for example synthesis module 104 can, for example, be carried out as shown in the figure. 11. Rather than selecting the most dominant IPLS to be reproduced coherently, the synthesis in Figure 11 performs a complete synthesis of each of the M layers separately. The i-th layer speaker L signals are output from block 502 and are denoted by 191j to 19Li. The speaker signal h-th 19h at the output of the first unit of the synthesis stage 501 is the sum of 19hi to 19hM. Please note that unlike figure 10b, the DOA estimation step in block 607 needs to be performed for each of the M layers.
[000272] Figure 26 illustrates an apparatus 950 to generate a virtual data stream from the microphone according to an application. Apparatus 950 for generating a virtual microphone data stream comprises an apparatus 960 for generating an audio output signal from a virtual microphone according to one of the applications described above, e.g. according to Fig. 12 , and an apparatus 970 for generating an audio data stream in accordance with one of the applications described above, for example in accordance with Fig. 2b , wherein the audio data stream generated by apparatus 970 for generating an audio data stream is the virtual data stream from the microphone.
[000273] The apparatus 960, for example, in figure 26 for audio output from a virtual microphone comprises an evaluator of the position of sound events and an information computing module as in figure 12. The evaluator of the position of sound events sound is adapted to estimate a sound source position indicating a sound source position in the environment, whereby the sound event position evaluator is adapted to estimate the sound source position based on a first direction information provided by the first real spatial microphone being located in a first position of the real microphone in the environment, and based on a second direction information provided by the second real spatial microphone being located in a second position of the real microphone in the environment. The information computing module is adapted to generate the audio output signal based on a recorded audio input signal based on the first position of the actual microphone and based on the calculated microphone position.
[000274] Apparatus 960 for generating an audio output signal from a virtual microphone is arranged to provide the audio output signal to apparatus 970 for generating an audio data stream. Apparatus 970 for generating an audio data stream comprises a determiner, e.g., the determiner 210 described with respect to Fig. 2b. The determiner of the apparatus 970 for generating an audio data stream determines the sound source data based on the audio output signal provided by the apparatus 960 to generate an audio output signal from a virtual microphone.
[000275] Figure 27 illustrates an apparatus 980 for generating at least one audio output signal based on an audio data stream according to one of the applications described above, for example, the apparatus according to claim , being configured to generate the audio output signal based on a virtual data stream from the microphone such as the audio data stream provided by an apparatus 950 to generate a virtual data stream from the microphone, e.g. figure 26.
[000276] The apparatus 980 for generating a virtual microphone data stream inserts the virtual microphone signal generated in the apparatus 980 to generate at least one audio output signal based on an audio data stream. Note that the microphone's virtual data stream is an audio data stream. Apparatus 980 for generating at least one audio output signal based on an audio data stream generates an audio output signal based on the microphone's virtual data stream as an audio data stream, e.g. as described with respect to the apparatus of Figure 2a.
[000277] Figure 1 illustrates an apparatus for generating a combined audio data stream according to an application.
[000278] In an application, the apparatus comprises a demultiplexer 180 for obtaining a plurality of single-layer audio data streams, wherein the demultiplexer 180 is adapted to receive one or more input audio data streams, wherein each input audio data stream comprises one or more layers, wherein the demultiplexer 180 is adapted to demultiplex each of the input audio data streams having one or more layers into two or more demultiplexed audio data streams having exactly one layer, such that one or more audio data streams demultiplexed together comprise one or more layers of the input audio data stream, to obtain two or more of the single layer audio data streams.
[000279] In another application, the apparatus comprises a demultiplexer 180 for obtaining a plurality of single-layer audio data streams, wherein the demultiplexer 180 is adapted to receive two or more input audio data streams, wherein each The input audio data stream comprises one or more layers, wherein the demultiplexer 180 is adapted to demultiplex each of the input audio data streams having two or more layers into two or more demultiplexed audio data streams having exactly one layer, so that the two or more audio data streams demultiplexed together comprise the two or more layers of the input audio data stream, to obtain two or more of the single layer audio data streams.
[000280] Further, the apparatus comprises a combining module 190 for generating the combined audio data stream, having one or more layers, based on the plurality of single layer audio data streams. Each layer of the input audio data streams, the demultiplexed audio data streams, the single-layer data streams and the combined audio data stream comprises a pressure value of a pressure signal, a position value and a broadcast value as audio data, the audio data being set to a time-frequency position of a plurality of time-frequency positions.
[000281] In an application, the apparatus can be adapted to insert one or more incoming audio data streams having exactly one layer directly to the combination module without inserting them to the demultiplexer, see dashed line 195.
[000282] In some applications, the demultiplexer 180 is adapted to modify the pressure values of the demultiplexed audio data streams to equalize the volumes (e.g. noise) of the different sound scenes represented by the demultiplexed audio data streams. For example, if two audio data streams originate from two different recording environments, and the first is characterized by low volume (for example, due to sources that are far away from the microphones, or simply due to microphones with low sensitivity or low preamp gain) you can increase the volume of the first audio data stream by multiplying a scale to the pressure values of the first audio data stream. Similarly, it is possible to reduce the volume of the second audio data stream in a similar way.
[000283] Figure 28 describes the inputs and outputs of an apparatus to generate a combined audio data stream according to another application. A number of audio data streams M, for example GAC streams M, and optionally a pressure signal p(t) and position q(t) of an artificial sound source to be injected, are input to the apparatus of Fig. 28 In another application, two or more artificial sound sources (synthetic sound sources) are inserted into the device. On output, an audio output stream, for example a GAC stream representing the modified sound scene, is returned.
[000284] Analogously, an audio output stream, eg a GAC stream, can be directly generated from a monosound source (ie without any combination).
[000285] The first type of input 1111, 1112, ..., 111M to the device are audio data streams, for example GAC M streams, where the i-th stream comprises Li layers,
Qac[the i-th audio data stream layer comprises one or more pressure values of the complex pressure signal Pi, the position of the source
, and the diffusion
in a time-frequency domain. If a two-dimensional representation is used, the font position can be set to
. It should be noted that all quantities depend on time and frequency indices (k, n). In a formulation, however, the dependence of time and frequency is not explicitly mentioned for a better readable formulation and for simplicity.
[000286] Input 1120 is optional information being represented in a time domain, pressure and position of an artificial sound source to be inserted into the sound scene. The output 1140 from the apparatus of Fig. 28 is an audio data stream, e.g., a GAC stream having Lo layers.
[000287] Figure 29 illustrates an apparatus for generating a combined audio data stream according to another application. In Fig. 29, the demultiplexer of Fig. 1 comprises a plurality of demultiplexing units. The apparatus of Fig. 29 comprises demultiplexing units (DEMUX) 1201, an artificial source generator (performing the audio stream, e.g. GAC stream, generating to an artificial source) 1202, and a combination module 1203.
[000288] Referring to one of the demultiplexing units 1201, the demultiplexing unit with respect to the GAC stream i-th llli, which comprises L± layers, outputs Li separate single-layer GAC streams. Artificial source generator 1202 generates a single-layer GAC stream for the artificial sound source.
[000289] The combination module 1203, which performs the combination, receives single-layer GAC streams N, where N is: M
(1)
[000290] Figure 30 describes a combination module 1203 according to an application. Single-layer audio data streams N, for example, single-layer GAC streams N, 1211 to 121N are combined, resulting in the audio data stream, for example, a GAC stream 1140, having Lo layers corresponding to the combination of the sound scenes, where Lo - N.
[000291] The combination is inter alia, based on the following concept: for each time-frequency position, there are N IPLS active, each described by one of the GAC N flows. Considering, 0 eg energy and diffusion, the sources most prominent Lo are identified. The first Lo - 1 sources are simply reassigned to the first layers of the Lo - 1 combined audio data stream, eg the output GAC stream, where all remaining sources are added to the last layer, ie Loth .
[000292] The apparatus of figure 30 comprises a charge function module 1401. The charge function module 1401 analyzes the pressure signals N and diffusion parameters N. The charge function module 1401 is configured to determine the sound sources more prominent for each time-frequency position. For example, the cost function fi for the flow i-th with ' J can be, for example, defined as

[000293] so that a sound source, for example, an IPLS, with high energy and low diffusion results in high values of the cost function. The fj cost function. calculates a cost value.
[000294] The output of the cost function module 1401 is the vector r of size Lo x 1, comprising the IPLS indices with the highest fi. Still, indices are ranked from most prominent IPLS to least. This information is passed to a position mixing unit 1403, a pressure combination unit 1404, and a diffusion combination unit 1405, where the resulting GAC flow parameters for each time-frequency position are calculated correctly. Applications such as calculating parameters are described in detail below.
[000295] The apparatus of figure 30 further comprises a sound scene adaptation module 1402. The sound scene adaptation module 1402 allows additional control over the combining step, where the GAC position information is manipulated before the combination real. In this way, various combination schemes can be obtained, for example, combination with complete overlapping of the events in the separate scenes, combination with placing the sound scenes side by side, combination with certain restrictions on the amount of overlap etc.
[000296] Figure 31a, Figure 31b and Figure 31c describe possible scenarios of the sound scene. Figure 31a shows two sound scenes with a speaker each. Vectors indicate a local coordinate system. After the combination, without any modification performed by the sound scene adaptation module 1402, a sound scene as described at the bottom of figure 31a will be obtained. This might be unwanted. By manipulating the coordinate system of one or more sound scenes, it is possible to arbitrarily compose the combined sound scene. In Fig. 31b, as an example, a rotation is introduced so that in the combined sound scenes the transmitters are separated. Translations (as shown in figure 31c) or non-linear transformations applied at positions Qi to QN are still possible.
[000297] Position mixing unit 1403, pressure combination unit 1404, and diffusion combination unit 1405 are adapted to receive the parameter N flows as input and are adapted to calculate the parameters of the resulting GAC's Lo flows .
[000298] Each of the parameters can be obtained as follows: a. Position mixing unit 1403 is adapted to determine the resulting position of the outgoing GAC stream. The position of the i-th source in the output stream Qi' corresponds to the position of the most prominent non-fuzzy input source i-th indicated by the vector r provided by the cost function module 1401.

[000299] where r± indicates the i-th element of r.
[000300] By determining the most prominent non-diffuse input sources L0-th as indicated by vector r, the position mixing unit 1403 determines a group comprising one or more single-layer audio data streams, where the value of The cost of each of the single-layer audio data streams in the group can be greater than the cost value of any single-layer audio data stream not comprised in the group. The position mixing unit 1403 is adapted to select/generate one or more position values from one or more layers of the combined audio data stream, so that each position value from each of the layer audio data streams group is a position value from one of the layers of the combined audio data stream. B. The resulting pressure for each of the flows is calculated by the pressure combination unit 1404. The pressure signal for all but the last flow GAC (L0-th) is equal to the corresponding pressure signal according to the input vector r . The GAC flow pressure L0-th is given as a linear combination of the pressures of each of the remaining pressure signals N - L0+1, for example

[000301] By determining the most prominent non-diffuse input sources L0-1 -th as indicated by vector r, the pressure combination unit is adapted to determine a first group comprising one or more single-layer audio data streams of the plurality of single layer audio data streams and to determine a second group (the remaining input sources in vector r) comprising one or more different single layer audio data streams from the plurality of single layer audio data streams , where the cost value of each of the first group's single-layer audio data streams is greater than the cost value of each of the second group's single-layer audio data streams. The pressure combining unit is adapted to generate one or more pressure values from one or more layers of the combined audio data stream, such that each pressure value from each of the first single layer audio data streams group is a pressure value from one of the layers of the combined audio data stream, and so that a combination of the pressure values from the second group single layer audio data streams is a pressure value from one of the layers of the combined audio data stream. ç. The diffusion of the resulting GAC stream is calculated by the diffusion combination unit 1405. Similar to other parameters, the diffusion is copied from the input streams to all but the last GAC stream L0-th

[000302] The L0-th diffusion parameters can, for example, be calculated considering that the pressure signal Lo comprises direct sound from more IPLS which will not be interpreted coherently, as only one position can be assigned. P' Thus, the amount of energy in Lo that corresponds to the direct sound is merely

[000303] Consequently, the diffusion can be obtained by

[000304] By determining the most prominent L0-1 -th non-fuzzy input sources as indicated by vector r, the spreading combination unit is adapted to determine a first group comprising one or more single layer audio data streams and to determining a second group (the remaining input sources in vector r) comprising one or more different single layer audio data streams from the plurality of single layer audio data streams, wherein the cost value of each of the streams The first group's single-layer audio data stream is greater than the cost value of each of the second group's single-layer audio data streams. The combination spread unit is adapted to generate one or more pressure values from one or more layers of the combined audio data stream, so that each spread value from each of the single layer audio data streams of the first group is a spread value from one of the layers of the combined audio data stream, and so that a combination of the spread values from the second group single layer audio data streams is a spread value from one of the layers of the combined audio data stream.
[000305] Finally, the resulting Lo single layer GAC streams are multiplexed in block 1406 to form the final GAC stream (output 1140) of the Lo layers.
[000306] In the following, artificial source generators according to applications are described in more detail with reference to figure 32a and figure 32b.
[000307] The artificial source generator is an optional module and uses as input 1120 a position and a pressure signal expressed in the time domain of an artificial sound source, which must be inserted into the sound scene. This then returns the GAC stream from the artificial source as output 121N.
[000308] Information about the position of the source in time is given to the first processing block 1301. If the sound source is not moving, block 1301 simply copies the position to all time-frequency positions Q(k, n) at exit 21N. 'For a mobile source, the information in q(t) is copied at all positions of frequency k corresponding to the correct time block n. The output from block 1301 is then directly passed as the GAC flow to block 1203. The pressure signal p(t) from the injected source 1120 can be a. directly converted to the GAC flow pressure signal P(k, n) (see figure 32a) b. first reverberated and then converted to GAC flow pressure signal
[000309] P(k, n) (see figure 32b).
[000310] According to application a), illustrated in figure 32a, the signal is transformed into the frequency domain using the analysis filter bank in block 1302 and then passed as a parameter of the GAC flow corresponding to the input source. If the pressure signal p(t) is not dry, the signal can pass through option block 1303, where noise and/or atmosphere are detected. The information on noise and atmosphere then passes to block 1304, which calculates the diffusion estimate. Block 1303 may implement a prior art algorithm for these purposes, as described in
[000311] [30] C. Uhle and C. Paul: A supervised learning approach to ambience extraction from mono recordings for blind upmixing in Proc, of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008.
[000312] The information about the noise and atmosphere is then passed to block 1304, which calculates the diffusion estimate. This is particularly useful to prevent the atmosphere and noise comprised in p(t) from being reproduced coherently in the synthesis. Thus, the mechanism already described ensures that the direct part of the signal is assigned a low diffusion value where the noisy and ambient parts of the signal are associated with high diffusion. As an alternative to passing the signal from block 1303 and 1304, the spread parameter in 121N can simply be set to a constant value.
[000313] Application b), illustrated in figure 32b, in some sense of the opposite situation, is covered. Assuming that p(t) is a dry signal, you may want to add reverberation to make the p(t) sound more natural, that is, to make the synthetic sound source sound as if it were recorded in a room. This is achieved by means of block 1305. Both the reverberated signals and the originals are passed through the transformation conducted with the analysis filter bank 1302 and are passed to the power index analysis block 1306. The block 1306 calculates the information as to reverberation and when the direct sound is present at a certain time-frequency position, for example, by calculating the Direct to Reverberation Ratio (DRR). This information is then passed to block 1304, in which the spread is calculated.
[000314] For high DRR the diffusion parameter is set to low values, while when reverb dominates (eg no remnants of the last reverb) the diffusion is set to high values.
[000315] Below, some special cases are described.
[000316] If single-layer GAC M streams need to be combined with a GAC Lo = 1 stream, then a simplified application can be employed. The resulting GAC flow will be characterized by: - pressure: pressure will be the sum of all pressure M signals. - position: position will be the position of the strongest sound sources, eg a stronger IPLS. - diffusion: the diffusion will be calculated according to formula (5) .
[000317] If the number of layers in the output is equal to the total number of layers in the input, ie Lo = N, then the output stream can be seen as a concatenation of the input streams.
[000318] Although some aspects have been described in the context of an apparatus, it is clear that these aspects still represent a description of the corresponding method, where a block or device corresponds to a method step or a characteristic of a method step. Similarly, aspects described in the context of a method step further represent a description of a corresponding unit or item or feature of a corresponding apparatus.
[000319] The inventive decomposed signal can be stored in a digital storage medium or it can be transmitted in a transmission medium like a wireless transmission medium or a wired transmission medium like the Internet.
[000320] Depending on certain implementation requirements, applications of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example, a floppy disk, a DVD, a Blu-Ray, a CD, a ROM memory, PROM, EPROM, EEPROM or a FLASH memory, with electronically readable control signals stored in it, that cooperate (or are able to cooperate) with a programmable computer system so that the respective method is carried out. Therefore, the digital storage medium can be a computer readable.
[000321] Some applications according to the invention comprise a data carrier with electronically readable control signals that are capable of cooperating with a programmable computer system, so that one of the methods described in this document is carried out.
[000322] Generally, applications of the present invention can be implemented as a computer program product with a program code, the program code being efficient to perform one of the methods when the computer program product is executed on a computer. Program code can, for example, be stored on a machine readable conveyor.
[000323] Other applications comprise the computer program to perform one of the methods described in this document, stored on a machine readable carrier.
[000324] In other words, an application of the inventive method is therefore a computer program with a program code to perform one of the methods described in this document, when the computer program is executed on a computer.
[000325] Another application of the inventive methods is thus a data carrier (or a digital storage medium, or a computer readable medium) comprising, recorded therein, a computer program to perform one of the methods described in this document. The data carrier, digital storage medium or recorded medium are normally tangible and/or non-transient.
[000326] Another application of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program to perform one of the methods described in this document. The data stream or the sequence of signals can, for example, be configured to be transferred via a data communication connection, for example, via the Internet.
[000327] Another application comprises a processing means, for example, a computer or a programmable logic device configured or adapted to perform one of the methods described in this document.
[000328] Another application comprises a computer having installed on it the computer program to perform one of the methods described in this document.
[000329] In some applications, a programmable logic device (eg, a programmable field gate array) can be used to perform some or all of the functionality of the methods described in this document. In some applications, a programmable field gate array can cooperate with a microprocessor to perform one of the methods described in this document. Generally, the methods are preferably performed on any hardware device.
[000330] The applications described above are merely illustrative for the principles of the present invention. It is understood that modifications and variations to the arrangements and details described in this document will be apparent to others skilled in the art. It is therefore intended to be limited only by the scope of the impending patent claims and not by the specific details presented in the form of description and explanation of applications in this document. LITERATURE:
[000331] [1] Michael A. Gerzon. Ambisonics in multichannel broadcasting and video. J. Audio Eng. Soc, 33(11):859-871, 1985.
[000332] [2] V. Pulkki, "Directional audio coding in spatial sound reproduction and stereo upmixing," in Proceedings of the AES 28th International Conference, p. 251-258, Piteâ, Sweden, June 30 - July 2, 2006.
[000333] [3] V. Pulkki, "Spatial sound reproduction with directional audio coding," J. Audio Eng. Soc., vol. 55, no. 6, pp. 503-516, June 2007.
[000334] [4] C. Faller: "Microphone Front-Ends for Spatial Audio Coders", in Proceedings of the AES 125th International Convention, San Francisco, Oct. 2008.
[000335] [5] M. Kallinger, H. Ochsenfeld, G. Del Galdo, F. Küch, D. Mahne, R. Schultz-Amiing. and 0. Thiergart, "A spatial filtering approach for directional audio coding," in Audio Engineering Society Convention 126, Munich, Germany, May 2009.
[000336] [6] R. Schultz-Amling, F. Küch, 0. Thiergart, and M. Kallinger, "Acoustical zooming based on a parametric sound field representation," in Audio Engineering Society Convention 128, London UK, May 2010.
[000337] [7] J. Herre, C. Falch, D. Mahne, G. Del Galdo, M. Kallinger, and 0. Thiergart, "Interactive teleconferencing combining spatial audio object coding and DirAC technology," in Audio Engineering Convention Society 128, London UK, May 2010.
[000338] [8] E.G. Williams, Fourier Acoustics: Sound Radiation and Nearfield Acoustical Holography, Academic Press, 1999.
[000339] [9] A. Kuntz and R. Rabenstein, "Limitations on the extrapolation of wave fields from circular measurements," in 15th European Signal Processing Conference (EUSIPCO 2007), 2007.
[000340] [10] A. Walther and C. Faller, "Linear simulation of spaced microphone arrays using b-format recordings," in Audio Engineering Society Convention 128, London UK, May 2010.
[000341] [11] US61/287,596: An Apparatus and a Method for Converting a First Parametric Spatial Audio Signal into a Second Parametric Spatial Audio Signal.
[000342] [12] S. Rickard and Z. Yilmaz, "On the approximate W-disjoint orthogonality of speech," in Acoustics, Speech and Signal Processing, 2002. ICASSP 2002. IEEE International Conference on, April 2002, vol. 1.
[000343] [13] R. Roy, A. Paulraj, and T. Kailath, "Direction-of-arrival estimation by subspace rotation methods - ESPRIT," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Stanford, CA, USA, April 1986.
[000344] [14] R. Schmidt, "Multiple emitter location and signal parameter estimation," IEEE Transactions on Antennas and Propagation, vol. 34, no. 3, pp. 276-280, 1986.
[000345] [15] J. Michael Steele, "Optimal Triangulation of Random Samples in the Plane", The Annals of Probability, Vol. 10, No.3 (Aug., 1982), pp. 548-553.
[000346] [16] F.J. Fahy, Sound Intensity, Essex: Elsevier Science Publishers Ltd., 1989.
[000347] [17] R. Schultz-Amling, F. Kiich, M. Kallinger, G. Del Galdo, T. Ahonen and V. Pulkki, "Planar microphone array processing for the analysis and reproduction of spatial audio using directional audio coding ," in Audio Engineering Society Convention 124, Amsterdam, The Netherlands, May 2008.
[000348] [18] M. Kallinger, F. Küch, R. Schultz-Amling, G. Del Galdo, T. Ahonen and V. Pulkki, "Enhanced direction estimation using microphone arrays for directional audio coding;" in Hands-Free Speech Communication and Microphone Arrays, 2008. HSCMA 2008, May 2008, pp. 45-48.
[000349] [19] R.K. Furness, "Ambisonics - An overview," in AES 8th International Conference, April 1990, pp. 181-189.
[000350] [20] Giovanni Del Galdo, Oliver Thiergart, TobiasWeller, and E.A.P. Habets. Generating virtual microphone signals using geometrical information gathered by distributed arrays. In Third Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA 'll), Edinburgh, United Kingdom, May 2011.
[000351] [21] Jurgen Herre, Cornelia Falch, Dirk Mahne, Giovanni Del Galdo, Markus Kallinger, and Oliver Thiergart. Interactive teleconferencing combining spatial audio object coding and DirAC technology. In Audio Engineering Society Convention 128, 5 2010.
[000352] [22] G. Del Galdo, F. Kuech, M. Kallinger, and R. Schultz-Amling. Efficient merging of multiple audio streams for spatial sound reproduction in directional audio coding. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2009), 2009.
[000353] [23] US 20110216908: Apparatus for Merging Spatial Audio Streams.
[000354] [24] Emmanuel Gallo and Nicolas Tsingos. Extracting and re-rendering structured auditory scenes from field recordings. In AES 30th International Conference on Intelligent Audio Environments, 2007.
[000355] [25] Jeroen Breebaart, Jonas Engdegârd, Cornelia Falch, Oliver Hellmuth, Johannes Hilpert, Andreas Hoelzer, Jeroesn Koppens, Werner Oomen, Barbara Resch, Erik Schuijers, and Leonid Terentiev. Spatial audio object coding (saoc) - the upcoming mpeg standard on parametric object based audio coding. In Audio Engineering Society Convention 124, 5 2008.
[000356] [26] R. Roy and T. Kailath. ESPRIT-estimation of signal parameters via rotational invariance techniques. Acoustics, Speech and Signal Processing, IEEE Transactions on, 37(7):984-995, July 1989.
[000357] [27] Tapio Lokki, Juha Merimaa, and Ville Pulkki. Method for reproducing natural or modified spatial impression in multichannel listening, 2006.
[000358] [28] Svein Merge. Device and method for converting spatial audio signal. US patent application, Appl. No. 10/547,151.
[000359] [29] Ville Pulkki. Spatial sound reproduction with directional audio coding. J. Audio Eng. Soc, 55(6):503-516, June 2007.
[000360] [30] C. Uhle and C. Paul: A supervised learning approach to ambience extraction from mono recordings for blind upmixing in Proc, of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008.

权利要求:
Claims (17)
[0001]
1. An apparatus for generating a merged audio data stream, wherein the apparatus is implemented using a hardware apparatus or a computer, wherein the apparatus comprises: a demultiplexer for acquiring a plurality of single-layer audio data streams , characterized in that the demultiplexer is adapted to receive one or more input audio data streams, each input audio data stream comprising one or more layers, wherein the demultiplexer is adapted to demultiplex each of the input audio data streams. input audio comprising one or more layers in two or more demultiplexed audio data streams comprising exactly one layer, so that the two or more demultiplexed audio data streams together comprise one or more layers of the input audio data stream , to acquire two or more of the single-layer audio data streams; and a merge module for generating the merged audio data stream, comprising one or more layers, based on the plurality of single-layer audio data streams, wherein each layer of the input audio data streams, of the streams. of demultiplexed audio data, single layer data streams and merged audio data stream comprises a pressure value of a pressure signal, a position value and a spread value as audio data, where the value Position indicates a position of a sound source.
[0002]
2. Apparatus according to claim 1, characterized in that the audio data is defined for a time-frequency bin of a plurality of time-frequency bins.
[0003]
3. Apparatus according to claim 2, characterized in that the fusion module further comprises a pressure fusion unit, wherein the pressure fusion unit is adapted to determine a first group comprising one or further single layer audio data streams from the plurality of single layer audio data streams and to determine a second group comprising one or more different single layer audio data streams. the plurality of single-layer audio data streams, wherein a cost value of each of the single-layer audio data streams of the first group is greater than a cost value of each of the audio data streams of the second group, or where the cost value of each of the first group's layer audio data streams is less than the cost value of each of the second group's single layer audio data streams group, in which the pressure fusion unit is adapted to generate one or more pressure values from one or more layers of the fused audio data stream, so that each pressure value from each of the audio data streams single layer of the first group is a pressure value of one of the layers of the merged audio data stream, and such that a combination of the pressure values of the single layer audio data streams of the second group is a pressure value of a layers of the merged audio data stream.
[0004]
4. Apparatus according to claim 2, characterized in that the fusion module further comprises a diffusion fusion unit, wherein the diffusion fusion unit is adapted to determine a third group comprising one or further single layer audio data streams from the plurality of single layer audio data streams and to determine a fourth group comprising one or more single layer audio data streams different from the plurality of single layer audio data streams , where a cost value of each of the third group single-layer audio data streams is greater than a cost value of each of the fourth group single-layer audio data streams, or where the The cost value of each of the third group layer audio data streams is less than the cost value of each of the fourth group single layer audio data streams, where the broadcast diffusion unit is adapted to generate a or more spread values from one or more layers of the merged audio data stream, such that each spread value from each of the third group single layer audio data streams is a spread value from one of the layers in the stream of merged audio data, and so that a combination of the spread values of the fourth group single-layer audio data streams is a spread value of one of the layers of the merged audio data stream.
[0005]
5. Apparatus according to claim 2, characterized in that the fusion module further comprises a position mixing unit, wherein the position mixing unit is adapted to determine a fifth group comprising a or more single layer audio data streams from the plurality of single layer audio data streams, wherein a cost value of each of the fifth group single layer audio data streams is greater than a value of cost of any single layer audio data streams not comprised in the fifth group of the plurality of single layer audio data streams, or where the cost value of each of the single layer audio data streams of the fifth group is less than the cost value of any single layer audio data streams not comprised in the fifth group of the plurality of single layer audio data streams, wherein the position value unit is adapted to generate one or more values of position of one or more layers of the merged audio data stream, such that each position value of each of the fifth group's single-layer audio data streams is a position value of one of the audio data stream's layers mixed.
[0006]
6. Apparatus according to claim 2, characterized in that the fusion module further comprises a sound scene adaptation module for manipulating the position value of one or more of the audio data streams of single layer of the plurality of single layer audio data streams.
[0007]
7. Apparatus according to claim 6, characterized in that the sound scene adaptation module is adapted to manipulate the position value of one or more of the single layer audio data streams of the plurality of audio streams. single-layer audio data by applying a rotation, translation, or non-linear transformation to the position value.
[0008]
8. Apparatus according to claim 1, characterized in that the fusion module comprises a cost function module for assigning a cost value to each of the single-layer audio data streams, and wherein the merge module is adapted to generate merged audio data stream based on the cost values assigned to single-layer audio data streams.
[0009]
9. Apparatus according to claim 8, characterized in that the cost function module is adapted to assign the cost value to each of the single layer audio data streams depending on at least one of the values of pressure or diffusion values of the single layer audio data stream.
[0010]
10. Apparatus according to claim 9, characterized in that the cost function module is adapted to assign the cost value to each audio data stream of the group of single layer audio data streams by applying the formula:
[0011]
11. Apparatus according to claim 1, characterized in that the demultiplexer is adapted to modify a magnitude of one of the pressure values of one of the demultiplexed audio data streams, multiplying the magnitude by a scalar value.
[0012]
12. Apparatus according to claim 1, characterized in that the demultiplexer comprises a plurality of demultiplexing units, wherein each of the demultiplexing units is configured to demultiplex one or more of the incoming audio data streams.
[0013]
13. Apparatus according to claim 1, characterized in that the apparatus further comprises an artificial source generator for generating an artificial data stream comprising exactly one layer, wherein the artificial source generator is adapted to receiving pressure information being represented in a time domain and for receiving position information, wherein the artificial source generator is adapted to replicate the pressure information to generate position information for a plurality of time-frequency boxes, and in that the artificial source generator is further adapted to calculate diffusion information based on the pressure information.
[0014]
14. Apparatus according to claim 13, characterized in that the artificial source generator is adapted to transform the pressure information being represented in a time domain to a time frequency domain.
[0015]
15. Apparatus according to claim 13, characterized in that the artificial source generator is adapted to add reverberation to the pressure information.
[0016]
16. Method for generating a merged audio data stream, characterized in that it comprises acquiring a plurality of single-layer audio data streams, wherein the demultiplexer is adapted to receive one or more input audio data streams , wherein each input audio data stream comprises one or more layers, wherein the demultiplexer is adapted to demultiplex each of the input audio data streams comprising one or more layers into two or more demultiplexed audio data streams comprising exactly one layer, so that the two or more audio data streams demultiplexed together comprise one or more layers of the input audio data stream, to acquire two or more of the single layer audio data streams; and generating the merged audio data stream, comprising one or more layers, based on the plurality of single-layer audio data streams, wherein each layer of the input audio data streams, of the demultiplexed audio data streams , of the single layer data streams and the merged audio data stream comprises a pressure value of a pressure signal, a position value and a spread value as audio data, the audio data being defined for a compartment of a plurality of time-frequency compartments, where the position value indicates a position of a sound source.
[0017]
17. Non-transient digital storage medium, comprising a computer program for implementing the method for generating a merged audio data stream, the method characterized in that it comprises acquiring a plurality of single-layer audio data streams , wherein the demultiplexer is adapted to receive one or more input audio data streams, wherein each input audio data stream comprises one or more layers, wherein the demultiplexer is adapted to demultiplex each of the data streams of audio data comprising one or more layers in two or more demultiplexed audio data streams comprising exactly one layer, so that the two or more demultiplexed audio data streams together comprise one or more layers of the input audio data stream, to acquire two or more of the single-layer audio data streams; and generating the merged audio data stream, comprising one or more layers, based on the plurality of single-layer audio data streams, wherein each layer of the input audio data streams, of the demultiplexed audio data streams , of the single layer data streams and the merged audio data stream comprises a pressure value of a pressure signal, a position value and a spread value as audio data, the audio data being defined for a compartment of a plurality of time-frequency compartments, where the position value indicates a position of a sound source, when executed on a computer or signal processor.

类似技术:

公开号 | 公开日 | 专利标题

BR112014013336B1|2021-08-24|APPARATUS AND METHOD FOR COMBINING SPATIAL AUDIO CODING FLOWS BASED ON GEOMETRY

ES2643163T3|2017-11-21|Apparatus and procedure for spatial audio coding based on geometry

KR101392546B1|2014-05-08|Apparatus, method and computer program for providing a set of spatial cues on the basis of a microphone signal and apparatus for providing a two-channel audio signal and a set of spatial cues

同族专利:

公开号 | 公开日

SG11201402777QA|2014-06-27|

JP2015502573A|2015-01-22|

AU2012343819C1|2017-11-02|

CA2857614A1|2013-06-06|

JP6086923B2|2017-03-01|

CN104185869B9|2018-01-12|

TWI555412B|2016-10-21|

ZA201404823B|2015-11-25|

AU2018200613A1|2018-02-08|

RU2609102C2|2017-01-30|

KR20140097555A|2014-08-06|

CN104185869A|2014-12-03|

MY167160A|2018-08-13|

BR112014013336A2|2021-01-26|

KR101666730B1|2016-10-14|

WO2013079663A3|2013-10-24|

US20130142341A1|2013-06-06|

AU2012343819A1|2014-07-24|

EP2786374A2|2014-10-08|

AU2016202604A1|2016-05-19|

EP2600343A1|2013-06-05|

AR089053A1|2014-07-23|

CA2857614C|2019-09-24|

CN104185869B|2017-10-17|

IN2014KN01069A|2015-10-09|

AU2012343819B2|2016-05-12|

TW201334580A|2013-08-16|

HK1202181A1|2015-09-18|

MX342794B|2016-10-12|

US9484038B2|2016-11-01|

RU2014126818A|2016-01-27|

WO2013079663A2|2013-06-06|

MX2014006199A|2014-12-08|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

EP0905933A3|1997-09-24|2004-03-24|STUDER Professional Audio AG|Method and system for mixing audio signals|

FI118247B|2003-02-26|2007-08-31|Fraunhofer Ges Forschung|Method for creating a natural or modified space impression in multi-channel listening|

PL1647010T3|2003-07-21|2018-02-28|Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.|Audio file format conversion|

DE60304859T2|2003-08-21|2006-11-02|Bernafon Ag|Method for processing audio signals|

US7483519B2|2003-12-23|2009-01-27|At&T Intellectual Property I, L.P.|Caller controlled systems to suppress system to de-activate 911 indicator|

US7392195B2|2004-03-25|2008-06-24|Dts, Inc.|Lossless multi-channel audio codec|

JP4943418B2|2005-03-30|2012-05-30|コーニンクレッカフィリップスエレクトロニクスエヌヴィ|Scalable multi-channel speech coding method|

KR20070108302A|2005-10-14|2007-11-09|삼성전자주식회사|Encoding method and apparatus for supporting scalability for the extension of audio data, decoding method and apparatus thereof|

DE102005057406A1|2005-11-30|2007-06-06|Valenzuela, Carlos Alberto, Dr.-Ing.|Method for recording a sound source with time-variable directional characteristics and for playback and system for carrying out the method|

JP5586950B2|2006-05-19|2014-09-10|韓國電子通信研究院|Object-based three-dimensional audio service system and method using preset audio scene|

EP2038878B1|2006-07-07|2012-01-18|Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.|Apparatus and method for combining multiple parametrically coded audio sources|

US20080232601A1|2007-03-21|2008-09-25|Ville Pulkki|Method and apparatus for enhancement of audio reconstruction|

US8131542B2|2007-06-08|2012-03-06|Honda Motor Co., Ltd.|Sound source separation system which converges a separation matrix using a dynamic update amount based on a cost function|

EP2144231A1|2008-07-11|2010-01-13|Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.|Low bitrate audio encoding/decoding scheme with common preprocessing|

EP2154910A1|2008-08-13|2010-02-17|Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.|Apparatus for merging spatial audio streams|

EP2154911A1|2008-08-13|2010-02-17|Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.|An apparatus for determining a spatial output multi-channel audio signal|

JP5540492B2|2008-10-29|2014-07-02|富士通株式会社|Communication device, sound effect output control program, and sound effect output control method|

US8705750B2|2009-06-25|2014-04-22|Berges Allmenndigitale Rådgivningstjeneste|Device and method for converting spatial audio signal|

EP2346028A1|2009-12-17|2011-07-20|Fraunhofer-Gesellschaft zur Förderung der Angewandten Forschung e.V.|An apparatus and a method for converting a first parametric spatial audio signal into a second parametric spatial audio signal|

US8731923B2|2010-08-20|2014-05-20|Adacel Systems, Inc.|System and method for merging audio data streams for use in speech recognition applications|US10154361B2|2011-12-22|2018-12-11|Nokia Technologies Oy|Spatial audio processing apparatus|

US9407992B2|2012-12-14|2016-08-02|Conexant Systems, Inc.|Estimation of reverberation decay related applications|

US20140355769A1|2013-05-29|2014-12-04|Qualcomm Incorporated|Energy preservation for decomposed representations of a sound field|

US9319819B2|2013-07-25|2016-04-19|Etri|Binaural rendering method and apparatus for decoding multi channel audio|

EP3028476B1|2013-07-30|2019-03-13|Dolby International AB|Panning of audio objects to arbitrary speaker layouts|

CN104683933A|2013-11-29|2015-06-03|杜比实验室特许公司|Audio object extraction method|

US10042037B2|2014-02-20|2018-08-07|Nestwave Sas|System and method for estimating time of arrival |

EP2942982A1|2014-05-05|2015-11-11|Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.|System, apparatus and method for consistent acoustic scene reproduction based on informed spatial filtering|

RU2666248C2|2014-05-13|2018-09-06|Фраунхофер-Гезелльшафт Цур Фердерунг Дер Ангевандтен Форшунг Е.Ф.|Device and method for amplitude panning with front fading|

US9620137B2|2014-05-16|2017-04-11|Qualcomm Incorporated|Determining between scalar and vector quantization in higher order ambisonic coefficients|

US10770087B2|2014-05-16|2020-09-08|Qualcomm Incorporated|Selecting codebooks for coding vectors decomposed from higher-order ambisonic audio signals|

US10140996B2|2014-10-10|2018-11-27|Qualcomm Incorporated|Signaling layers for scalable coding of higher order ambisonic audio data|

US9984693B2|2014-10-10|2018-05-29|Qualcomm Incorporated|Signaling channels for scalable coding of higher order ambisonic audio data|

EP3780589A1|2015-02-03|2021-02-17|Dolby Laboratories Licensing Corporation|Post-conference playback system having higher perceived quality than originally heard in the conference|

EP3254456B1|2015-02-03|2020-12-30|Dolby Laboratories Licensing Corporation|Optimized virtual scene layout for spatial meeting playback|

HK1255002A1|2015-07-02|2019-08-02|杜比實驗室特許公司|Determining azimuth and elevation angles from stereo recordings|

US10375472B2|2015-07-02|2019-08-06|Dolby Laboratories Licensing Corporation|Determining azimuth and elevation angles from stereo recordings|

EP3332557B1|2015-08-07|2019-06-19|Dolby Laboratories Licensing Corporation|Processing object-based audio signals|

CN105117111B|2015-09-23|2019-11-15|小米科技有限责任公司|The rendering method and device of virtual reality interactive picture|

US10206040B2|2015-10-30|2019-02-12|Essential Products, Inc.|Microphone array for generating virtual sound field|

EP3405949B1|2016-01-22|2020-01-08|Fraunhofer Gesellschaft zur Förderung der Angewand|Apparatus and method for estimating an inter-channel time difference|

US10923132B2|2016-02-19|2021-02-16|Dolby Laboratories Licensing Corporation|Diffusivity based sound processing method and apparatus|

US9949052B2|2016-03-22|2018-04-17|Dolby Laboratories Licensing Corporation|Adaptive panner of audio objects|

US20170293461A1|2016-04-07|2017-10-12|VideoStitch Inc.|Graphical placement of immersive audio sources|

GB2551780A|2016-06-30|2018-01-03|Nokia Technologies Oy|An apparatus, method and computer program for obtaining audio signals|

US10187740B2|2016-09-23|2019-01-22|Apple Inc.|Producing headphone driver signals in a digital audio signal processing binaural rendering environment|

US10820097B2|2016-09-29|2020-10-27|Dolby Laboratories Licensing Corporation|Method, systems and apparatus for determining audio representation of one or more audio sources|

KR20180090022A|2017-02-02|2018-08-10|한국전자통신연구원|Method for providng virtual-reality based on multi omni-direction camera and microphone, sound signal processing apparatus, and image signal processing apparatus for performin the method|

GB2561595A|2017-04-20|2018-10-24|Nokia Technologies Oy|Ambience generation for spatial audio mixing featuring use of original and extended signal|

CN111108555A|2017-07-14|2020-05-05|弗劳恩霍夫应用研究促进协会|Concept for generating an enhanced or modified sound field description using depth extended DirAC techniques or other techniques|

CA3069403A1|2017-07-14|2019-01-17|Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V.|Concept for generating an enhanced sound-field description or a modified sound field description using a multi-layer description|

GB2566992A|2017-09-29|2019-04-03|Nokia Technologies Oy|Recording and rendering spatial audio signals|

CN110853657A|2019-11-18|2020-02-28|北京小米智能科技有限公司|Space division method, device and storage medium|

GB2590650A|2019-12-23|2021-07-07|Nokia Technologies Oy|The merging of spatial audio parameters|

法律状态:
2021-02-02| B06U| Preliminary requirement: requests with searches performed by other patent offices: procedure suspended [chapter 6.21 patent gazette]|

2021-06-15| B09A| Decision: intention to grant [chapter 9.1 patent gazette]|

2021-08-24| B16A| Patent or certificate of addition of invention granted [chapter 16.1 patent gazette]|Free format text: PRAZO DE VALIDADE: 20 (VINTE) ANOS CONTADOS A PARTIR DE 30/11/2012, OBSERVADAS AS CONDICOES LEGAIS. |

优先权:

申请号 | 申请日 | 专利标题

EP11191816.5A|EP2600343A1|2011-12-02|2011-12-02|Apparatus and method for merging geometry - based spatial audio coding streams|

US13/455,585|2012-04-12|

US13/445,585|US9484038B2|2011-12-02|2012-04-12|Apparatus and method for merging geometry-based spatial audio coding streams|

PCT/EP2012/074097|WO2013079663A2|2011-12-02|2012-11-30|Apparatus and method for merging geometry-based spatial audio coding streams|

[返回顶部]